Chapter 1: Introduction¶
1.1 What Operating Systems Do¶
This section introduces the fundamental role of an operating system (OS) within a computer system. Think of the OS as the core manager that makes everything else possible.
The Four Components of a Computer System¶
A computer system can be broken down into four main parts:
- Hardware: The physical components you can touch. This includes the CPU (Central Processing Unit) for computation, memory (RAM) for temporary data storage, and I/O (Input/Output) devices like keyboards, mice, disks, and monitors. These are the basic computing resources.
- Operating System: The crucial software that controls the hardware and acts as a coordinator.
- Application Programs: The software you use to get things done, like word processors (Microsoft Word), web browsers (Google Chrome), compilers (GCC), and games. These use the resources provided by the hardware.
- Users: The people who interact with the application programs.
(Refer to Figure 1.1: Abstract view of the components of a computer system)
The figure shows a layered view: the user interacts with the application programs, which rely on the operating system, which directly controls the computer hardware.
The Operating System as a Government¶
A helpful analogy is to think of the operating system as a government. A government itself doesn't build houses or grow food. Instead, it provides a structure, rules, and services (like roads and laws) that allow its citizens (the application programs) to work productively and coexist peacefully. Similarly, the OS doesn't do useful work directly for the user; it creates an environment where other programs can do useful work efficiently and without interfering with each other.
To fully understand the OS, we can look at it from two different perspectives.
1.1.1 User View¶
How you see the operating system depends heavily on the type of computer you're using. The OS is designed differently for different user experiences.
Desktop and Laptop Users: If you're using a PC or Mac, you are the sole user of the machine. The operating system's primary goal here is ease of use. It wants to help you maximize your work or play. Performance (speed) and security (protecting your data) are also important, but resource utilization—how efficiently the hardware resources are shared—is not a top priority because you aren't sharing them with anyone else.
Mobile Device Users: For users of smartphones and tablets, the view is similar but the interface is different. Interaction is through touchscreens (taps, swipes) and often voice commands (like Siri or Google Assistant). These devices are almost always connected to networks.
No Direct User View: Many computers are designed to run without a user directly interacting with them. These are embedded computers found in devices like smart appliances, car engines, or industrial machines. Their operating systems are built to run a specific set of programs reliably and without user intervention. The "user view" might just be a few status lights.
1.1.2 System View¶
From the computer's own perspective, the operating system is the program that has the most direct control over the hardware. We can define the OS through two key roles:
Resource Allocator: A computer system is a collection of expensive resources: CPU time, memory space, file storage space, and I/O devices. Imagine many different application programs and users all needing these resources at the same time, sometimes in conflicting ways (e.g., two programs wanting to print at once). The operating system is the manager that decides how to allocate these resources to each program. Its goal is to make sure the entire system operates efficiently (no resources are wasted) and fairly (no single program hogs all the resources).
Control Program: This view emphasizes the OS's role in managing and controlling the execution of programs. The operating system acts as a supervisor to prevent errors and stop programs from using the computer improperly. It is especially concerned with controlling I/O devices, which are complex and can easily be used incorrectly, potentially crashing the whole system. This control ensures stability and security.
1.1.3 Defining Operating Systems¶
We've seen that an operating system can be a resource allocator, a control program, and something that looks different to various users. So, what exactly is an operating system? This section explains why it's hard to pin down a single definition and provides the definitions we'll use in this book.
The Challenge of a Single Definition¶
There is no single, perfect definition for an operating system. The main reason is the incredible diversity of computers themselves. OSes exist in everything from simple toasters and cars to powerful servers and spacecraft. This diversity stems from the rapid evolution of computers.
- Historical Context: Computers evolved quickly from single-purpose machines (e.g., for code-breaking or census calculations) to general-purpose mainframes. It was this shift to multifunction systems that made operating systems necessary.
- Moore's Law: The prediction that computing power would double roughly every 18 months led to computers becoming both more powerful and smaller. This explosion in capability and form factors resulted in a huge variety of operating systems, each designed for specific needs.
Since computers are used for so many different things, the OS designed to manage them must also vary. The fundamental goal, however, remains the same: to make the computer system usable. Bare hardware is difficult to program. The OS simplifies this by providing common, reusable functions (like handling I/O devices) so that application developers don't have to reinvent the wheel for every program.
What's Included in an Operating System?¶
There's also no universal agreement on what software components are officially part of the "operating system." Two common views are:
The "Everything Shipped" View: The operating system is everything the software vendor includes in the box when you buy "the operating system." The problem with this definition is that it varies wildly. One OS might be tiny (less than 1 MB) and text-based, while another (like Windows or macOS) is huge (gigabytes) and based on a graphical interface.
The Kernel View (The Core Definition): This is the definition we will usually follow. Here, the operating system is defined as the kernel—the one program that is always running on the computer. The kernel is the core that manages hardware resources and is essential for the system's operation.
However, when we talk about an OS in a broader sense, we often include other programs that come with it. We can categorize all software on a computer into three types:
- Kernel: The core, always-running part of the OS.
- System Programs: Programs that are associated with and support the operating system but are not part of the kernel. Examples include disk formatters, system monitors, and even command-line shells.
- Application Programs: All programs not needed for the system's operation, like word processors, web browsers, and games.
The Modern Complexity: The Case of Mobile OSes¶
The question of "what is part of the OS" became a major legal issue in the 1990s with the U.S. vs. Microsoft case, where Microsoft was accused of adding too much functionality (like a web browser) into its OS to limit competition.
Today, this bundling is common, especially in mobile operating systems like Apple's iOS and Google's Android. These systems include:
- The core kernel.
- Middleware: A set of software frameworks that provide standard services to application developers, such as for databases, multimedia, and graphics. This makes it easier to write apps.
Summary Definition for This Book¶
For our purposes, we will consider the operating system to include:
- The kernel (the essential core).
- Middleware frameworks (common in modern systems).
- System programs (tools for managing the system).
Most of this textbook will focus on the concepts and techniques involved in building the kernel of a general-purpose operating system, as that is where the fundamental challenges of resource management and control are solved.
Why Study Operating Systems?¶
You might wonder why you need to study OSes if you don't plan to build one. The reason is crucial: almost all code runs on top of an operating system.
Understanding how the OS works is essential for:
- Efficient Programming: Knowing how the OS manages memory, the CPU, and I/O devices helps you write programs that use these resources efficiently.
- Effective Problem-Solving: When something goes wrong (e.g., a program runs slowly or crashes), understanding the OS helps you diagnose the problem.
- Secure Programming: Many security vulnerabilities stem from misunderstandings about how the OS manages processes and memory. Understanding the OS is key to writing secure code.
In short, understanding the fundamentals of operating systems is not just for OS developers; it is highly useful for any programmer or computer scientist who writes applications that run on them.
1.2 Computer-System Organization¶
This section dives into the hardware organization of a standard computer system. Understanding this hardware setup is crucial because the operating system is designed directly to manage and interact with these components. Think of this as learning the "playing field" on which the OS operates.
The Basic Computer System Layout¶
A modern general-purpose computer is built around several key components connected by a central highway:
- CPUs (Central Processing Units): These are the brains of the computer, responsible for executing program instructions. Modern systems often have multiple CPUs (or cores).
- Device Controllers: These are specialized processors that manage specific types of I/O devices. Each controller is in charge of a particular device type, like a disk drive, graphics adapter, or USB port. A controller can often manage more than one device (e.g., a USB controller can manage a keyboard, mouse, and printer through a hub).
- Shared Memory (RAM): This is the main memory that both the CPUs and the device controllers can access.
- System Bus: This is the communication pathway that connects the CPUs, memory, and device controllers, allowing them to exchange data and signals.
(Refer to Figure 1.2: A typical PC computer system)
The figure shows how the CPU and various device controllers (for disks, graphics, USB, etc.) are all connected to the shared memory via the common system bus.
How Device Controllers Work: Each device controller has its own small, fast memory called a local buffer and a set of special-purpose registers. The controller's job is to handle the low-level details of communicating with its specific device. For example, a disk controller moves data between the physical disk drive and its local buffer.
The Role of the Operating System: Device Drivers The operating system doesn't talk directly to the complex hardware of each device controller. Instead, for each controller type, the OS has a software component called a device driver. The driver understands the specific details and commands for its controller. It provides a simple, standard interface to the rest of the OS. This means the OS core can just say "read a block of data" to any disk driver, without needing to know if it's a SATA SSD or a NVMe drive.
Parallel Execution and Memory Access: A key point is that the CPU and the device controllers can all run at the same time (in parallel). They are independent units. They often need to access the main memory (RAM) simultaneously—the CPU to fetch instructions and data, and a controller to place data it has just read from a device. To prevent chaos, a memory controller is used to synchronize access to the shared memory, ensuring orderly reads and writes.
We will now explore three fundamental aspects of how this system operates, starting with the crucial concept of interrupts.
1.2.1 Interrupts¶
Interrupts are the fundamental mechanism that allows the CPU to be notified when an event requires its attention. They are essential for efficient I/O handling and prevent the CPU from wasting time waiting for slow devices.
A Typical I/O Operation with Interrupts¶
Let's walk through the example of a program reading a character from the keyboard. This illustrates why interrupts are needed.
- Initiating the I/O: The program requests a read operation. The OS's device driver for the keyboard takes over. It communicates with the keyboard controller by loading commands and parameters into the controller's registers (e.g., "read a character").
- Controller Handles the Details: The keyboard controller then performs the actual work. It waits for a keypress, receives the electrical signal, determines which character was pressed, and transfers that character's data into its local buffer.
- The Waiting Problem: Once the driver has started the controller, what should the CPU do? Without interrupts, the CPU would have to sit in a loop constantly checking (or polling) the controller's status register to see if the data is ready. This is incredibly inefficient and wastes CPU cycles that could be used for other work.
- The Interrupt Solution: Instead of polling, the system uses an interrupt. When the keyboard controller has finished transferring the character to its buffer, it signals the CPU that it has finished by triggering an interrupt.
- CPU Response: The interrupt signal causes the CPU to immediately pause its current work, save its state (so it can resume later), and jump to a special function called an interrupt handler (or interrupt service routine). This handler is part of the device driver.
- Completion: The interrupt handler in the keyboard driver reads the data from the controller's buffer, delivers it to the requesting program, and then informs the OS scheduler that the program waiting for input can now continue. Finally, the CPU restores its saved state and resumes what it was doing before the interrupt occurred.
In summary, the interrupt is the controller's way of saying, "I'm done with the task you gave me." This mechanism allows the CPU to work on other tasks while slow I/O operations are in progress, leading to much higher system efficiency.
1.2.1.1 Interrupts: A Detailed Overview¶
This section provides a deeper look into the hardware mechanism of interrupts, which is a critical concept in computer architecture that enables efficient multitasking and I/O handling.
The Basic Interrupt Mechanism¶
Interrupts are signals sent by hardware devices to the CPU via the system bus (the main communication highway connecting the CPU, memory, and controllers). These signals can arrive at any time, and they are the primary way hardware communicates with the CPU.
The standard sequence of events when an interrupt occurs is as follows:
- CPU is Interrupted: The CPU is executing instructions from a program.
- Transfer to Fixed Location: The interrupt signal causes the CPU to stop its current work immediately. It then transfers execution to a predetermined, fixed memory location.
- Execute Interrupt Service Routine (ISR): This fixed location contains the starting address of an Interrupt Service Routine (ISR), which is a special function designed to handle the specific interrupt. The ISR, which is part of the device driver, executes.
- Resume Computation: Once the ISR has finished its task, the CPU resumes executing the original program from the exact point where it was interrupted, as if nothing had happened.
(Refer to Figure 1.3: Interrupt timeline for a single program doing output)
This figure shows a visual timeline: the program executes, issues an I/O request, and continues executing. Meanwhile, the I/O device is working. When the device finishes, it issues an interrupt. The CPU then switches to the I/O interrupt service routine to handle the completion before returning control to the user program.
The Need for Speed: The Interrupt Vector¶
Interrupts happen very frequently in a running system, so they must be handled as quickly as possible. Each type of device (keyboard, disk, timer) has its own unique ISR. The CPU needs a fast way to find the correct ISR for a given interrupt.
The solution is an interrupt vector. This is a table (an array) stored in a fixed location in low memory (the first hundred or so memory addresses). Each entry in this table is a pointer to the starting address of an ISR for a specific interrupt.
Here's how it works:
- When a device triggers an interrupt, it also sends a unique number, called an interrupt request number (IRQ), along with the signal.
- The CPU uses this IRQ number as an index into the interrupt vector table.
- It looks up the address stored at that index in the table.
- It immediately jumps to that address to execute the correct ISR.
This method is extremely fast because it involves a simple table lookup and jump, with no intermediate steps. This design is common across different operating systems like Windows and UNIX/Linux.
Preserving State: The Importance of Saving and Restoring¶
A crucial requirement for interrupts to work transparently is that the interrupted program must not know it was interrupted. From the program's perspective, its instructions execute sequentially without any breaks. To achieve this, the system must save the state of the interrupted computation.
The state includes all the information needed to resume execution exactly where it left off. This primarily means:
- The program counter (PC), which holds the address of the next instruction to execute.
- The contents of all CPU registers.
This saving of state can happen in two ways:
- Hardware-Automated Saving: The CPU hardware itself may automatically save the program counter and possibly some key registers onto a special stack (the kernel stack) before jumping to the ISR.
- Software Saving within the ISR: The ISR code itself is responsible for saving the state of any registers it plans to use. Before doing any work, the ISR must save the current values of these registers (typically by pushing them onto the stack). Before returning, it must restore these saved values.
After the ISR finishes, the saved return address (the old program counter) is loaded back into the PC, and the restored register values are used. This allows the interrupted computation to continue seamlessly, completely unaware of the interrupt. This careful saving and restoring of state is what makes concurrent execution of multiple processes possible.
1.2.1.2 Interrupt Implementation¶
This section explains the precise hardware and software steps of an interrupt and then discusses the advanced features needed in a modern OS.
The Step-by-Step Interrupt Mechanism¶
The process is a precise sequence of cooperation between hardware and software:
Hardware Detection: The CPU hardware has a special wire called the interrupt-request line. The CPU checks this wire for a signal after executing every single instruction.
Interrupt Signal: When a device controller needs service, it asserts a signal (puts a voltage) on this line. We say the controller raises an interrupt.
CPU Response: The CPU catches the interrupt. It immediately saves the address of the next instruction (the program counter) and then reads an interrupt number provided by the interrupting device.
Handler Dispatch: The CPU uses this interrupt number as an index into the interrupt vector table stored in low memory. It retrieves the address of the corresponding interrupt handler (a routine within the device driver) and jumps to it.
Software Handling (The ISR): The interrupt handler now executes. Its tasks are:
- Save State: It saves the current state (CPU register values) that it will modify, typically by pushing them onto a stack.
- Process Interrupt: It determines the cause (e.g., which device raised the interrupt) and performs the necessary processing (e.g., reading data from the device controller's buffer).
- Restore State: It restores the saved register values by popping them off the stack.
- Return: It executes a special return from interrupt instruction. This instruction restores the CPU to its pre-interrupt state, resuming the interrupted program.
Cycle Completion: The interrupt is considered cleared once the handler has serviced the device.
(Refer to Figure 1.4: Interrupt-driven I/O cycle)
This figure provides a numbered flowchart of the entire process, showing the interaction between the CPU and the I/O controller, from initiating I/O to resuming the interrupted task.
Advanced Interrupt-Handling Features¶
A basic interrupt system is sufficient for simple computers, but modern operating systems require more sophisticated control. Three key needs are:
- Deferring Interrupts: The OS must be able to postpone handling interrupts during critical tasks, such as when it is already processing a crucial OS data structure.
- Efficient Dispatching: The system needs a fast way to find the right handler for a device.
- Prioritization: Not all interrupts are equally urgent. A network packet arriving is time-sensitive, while a printer finishing a job is not. The OS needs a way to prioritize.
These features are provided by the CPU and an additional chip called an interrupt controller.
Interrupt Lines: Maskable vs. Nonmaskable¶
Most CPUs have two distinct interrupt-request lines to address the need for deferment:
- Nonmaskable Interrupt (NMI): This interrupt cannot be disabled (or "masked") by the CPU. It is reserved for critical, unrecoverable hardware errors like a memory parity error or a power failure warning. The OS must respond to it immediately.
- Maskable Interrupt: This is the standard interrupt line used by device controllers (disks, network cards, etc.). The CPU can temporarily disable (mask) these interrupts before executing a critical sequence of instructions that must complete without being interrupted. This ensures the OS can perform essential tasks atomically.
Interrupt Chaining: Handling Many Devices¶
The interrupt vector table has a limited number of entries. But a modern PC has more devices than available vector numbers. A common solution is interrupt chaining.
In this scheme, each entry in the interrupt vector table does not point to a single handler. Instead, it points to the head of a linked list of interrupt handlers. When an interrupt occurs, all the handlers in the corresponding chain are called one by one. Each handler checks if its device was the source of the interrupt. The first handler that recognizes its device services the interrupt.
This is a compromise: it avoids the need for a massive interrupt table while also avoiding the inefficiency of having one giant handler check every possible device.
(Refer to Figure 1.5: Intel processor event-vector table)
This table shows a real-world example. Entries 0-31 are for nonmaskable events like divide-by-zero errors and page faults. Entries 32-255 are for maskable, device-generated interrupts.
Interrupt Priority Levels¶
The interrupt mechanism also supports priority levels. This allows the CPU to defer low-priority interrupts without disabling all interrupts. More importantly, it enables interrupt preemption: a high-priority interrupt can itself interrupt the execution of a low-priority interrupt handler. This ensures the most urgent work is done first.
Summary¶
Interrupts are the fundamental mechanism for handling asynchronous events in a computer system, from I/O completion to hardware errors. Modern systems use a sophisticated architecture involving maskable interrupts, an interrupt vector for fast dispatching, chaining for flexibility, and priority levels to ensure that time-critical tasks receive immediate attention. Efficient interrupt handling is absolutely essential for good system performance.
1.2.2 Storage Structure¶
The Central Role of Memory¶
Think of the CPU as the brain of the computer. This brain can only think about things that are right in front of it. In computer terms, the CPU can only execute instructions that are already loaded into the main memory (also called RAM - Random Access Memory). This is a fast, rewritable, but volatile type of memory. "Volatile" means it's like short-term memory; it forgets everything as soon as the power turns off. Main memory is typically built using DRAM (Dynamic Random-Access Memory) technology.
But if the computer just turned on and RAM is empty, how does it know what to do first? This is where other, more permanent types of memory come in.
Nonvolatile Memory: The Bootstrap and Firmware¶
The very first program that runs when you turn on a computer is the bootstrap program. Since the RAM is empty and volatile, this program can't be stored there. Instead, it's stored in nonvolatile memory, which retains its contents even without power.
One common type is EEPROM (Electrically Erasable Programmable Read-Only Memory), a form of firmware. Think of EEPROM as a permanent notepad. You can write on it and change it if you really need to, but you don't do it often because it's a slow process. It's perfect for storing essential, rarely-changed information like the bootstrap program, a device's serial number, or hardware settings (like on an iPhone).
Storage Definitions and Notation¶
Before we go further, let's define the basic building blocks of storage:
- Bit: The smallest unit, a single binary digit (0 or 1).
- Byte: A group of 8 bits. This is the smallest unit that a computer typically moves around.
- Word: The natural unit of data for a specific computer architecture. It's the size of the processor's registers. A word is made up of one or more bytes. For example, a 64-bit computer has a word size of 64 bits, or 8 bytes. The CPU likes to work with full words whenever possible.
We measure storage in bytes. Because computers use binary math, the units are based on powers of 2 (2^10 = 1024), not powers of 10 (1000). However, manufacturers often use the decimal system for marketing.
- Kilobyte (KB): 1,024 bytes
- Megabyte (MB): 1,024^2 bytes (1,048,576 bytes)
- Gigabyte (GB): 1,024^3 bytes
- Terabyte (TB): 1,024^4 bytes
- Petabyte (PB): 1,024^5 bytes
Important Exception: Networking speeds are measured in bits per second (e.g., Mbps), because data is sent one bit at a time over a network.
How the CPU Interacts with Memory¶
All memory can be thought of as a massive array of bytes, where each byte has a unique address. The CPU interacts with memory using two fundamental instructions:
- Load: Moves a byte or word from main memory into a CPU register.
- Store: Moves the content of a CPU register to a byte or word in main memory.
This happens in the instruction-execution cycle (von Neumann architecture):
- Fetch: The CPU fetches the next instruction from the memory address held in the program counter and loads it into the instruction register.
- Decode: The CPU decodes the instruction.
- Execute: If the instruction needs data from memory, it performs a load to get the operands into registers. The CPU executes the instruction.
- Store Result: The result may be stored back into memory.
From the memory's perspective, it just sees a stream of addresses coming from the CPU. It doesn't know or care if an address is for an instruction or a piece of data.
The Need for Secondary Storage¶
Ideally, we'd keep everything in fast main memory (RAM). But this is impossible for two reasons:
- Size: Main memory is too small to hold all of our programs and data permanently.
- Volatility: RAM loses all data when the power is lost.
To solve this, computers use secondary storage. This is nonvolatile storage that can hold massive amounts of data permanently. The most common types are Hard-Disk Drives (HDDs) and Nonvolatile Memory (NVM) devices (like SSDs). Programs live on secondary storage until they are loaded into RAM to run. The trade-off? Secondary storage is much slower than main memory. Managing this speed difference is a critical job of the operating system (discussed in Chapter 11).
The Memory Hierarchy¶
We don't just have two types of storage. We have a whole pyramid of storage types, called the memory hierarchy. (Refer to Figure 1.6 in your text).
This hierarchy is organized based on a trade-off between speed, size, and cost:
- Rule of Thumb: The smaller and faster the memory, the more expensive it is per byte and the closer it needs to be to the CPU.
- Volatility: The hierarchy is also split between volatile and nonvolatile storage.
Let's walk through the hierarchy from top (fastest, smallest) to bottom (slowest, largest):
| Level | Example | Volatile? | Typical Use |
|---|---|---|---|
| Primary Storage | Registers (inside CPU) | Volatile | Holding the data the CPU is working on right now. |
| Cache (L1, L2, L3) | Volatile | A buffer between the super-fast CPU and slower RAM. | |
| Main Memory (RAM) | Volatile | Holding programs and data currently in use. | |
| Secondary Storage | NVM Devices (e.g., SSDs, Flash) | Nonvolatile | Permanent storage for programs and data; faster than HDDs. |
| HDDs (Hard Drives) | Nonvolatile | Permanent storage for programs and data; high capacity, low cost. | |
| Tertiary Storage | Magnetic Tapes, Optical Discs | Nonvolatile | Archival and backup; very slow, very high capacity. |
Key Technology Notes:
- The top three levels (Registers, Cache, RAM) are built with semiconductor memory (like DRAM).
- NVM devices (like the flash memory in your phone or an SSD) are becoming extremely common and are faster than traditional hard drives (HDDs).
Operating System Terminology for Storage¶
To keep things clear throughout the book, the text will use specific terms:
- "Memory": This will always refer to volatile storage (RAM). If it means something else (like a register), it will be specified.
- "NVS" (NonVolatile Storage): This is the general term for storage that persists without power. It's divided into two types:
- Mechanical Storage: Devices with moving parts. Examples: HDDs, optical disks, magnetic tapes. Generally, these are larger, slower, and cheaper per byte.
- Electrical Storage / NVM (NonVolatile Memory): Solid-state devices with no moving parts. Examples: Flash memory, SSDs (Solid-State Drives). Generally, these are faster, smaller, and more expensive per byte.
The goal of a good storage system design is to balance all these factors: use fast memory where you need speed, and cheap, spacious storage where you need capacity. Caches are a crucial tool for this, acting as high-speed buffers to smooth over the large speed differences between levels of the hierarchy.
1.2.3 I/O Structure¶
The Importance of I/O Management¶
A huge part of an operating system's job is managing Input/Output (I/O). This is critical for both system reliability and performance. I/O is complex because there are so many different types of devices (keyboards, disks, network cards, etc.), each with their own speeds and ways of communicating.
The Problem with Simple Interrupt-Driven I/O¶
Recall from Section 1.2.1 the basic concept of interrupt-driven I/O: a device sends an interrupt signal to the CPU when it needs attention. This works well for slow devices that handle small amounts of data, like a keyboard where you type one character at a time.
However, this method creates a lot of overhead for bulk data transfers, like reading a large file from a hard drive. Imagine if for every single byte of that file, the hard drive had to interrupt the CPU. The CPU would spend almost all its time just handling interrupts instead of doing useful work, slowing the entire system to a crawl.
The Solution: Direct Memory Access (DMA)¶
To solve this problem, computers use a smarter component called a DMA (Direct Memory Access) controller.
Here’s how DMA works for a large data transfer, like reading a file from a storage device:
Setup: The CPU does some initial work. It tells the DMA controller the following:
- The memory address where the data should be written (or read from).
- The number of bytes to transfer.
- The direction of the transfer (read from device or write to device).
Transfer: Once set up, the DMA controller takes over. It manages the entire data transfer directly between the I/O device and the main memory.
- The CPU is free during this time to execute other tasks.
Completion: After the entire block of data has been transferred, the DMA controller sends a single interrupt to the CPU to say, "The operation is complete."
The key advantage: Instead of one interrupt per byte (which is inefficient), we have one interrupt per large block of data. This dramatically reduces the CPU's overhead and allows for much faster I/O operations.
System Architecture: Buses vs. Switches¶
Most standard computers use a bus architecture, where a single shared communication pathway (the bus) connects the CPU, memory, and I/O devices. Devices take turns using the bus. This can create a bottleneck if multiple devices need to communicate at once.
High-end systems (like powerful servers) often use a switch architecture. Think of it like a network switch: the switch allows multiple components to have direct, concurrent conversations with each other. For example, a disk drive can transfer data to memory at the same time as a network card is receiving data. In this kind of system, DMA becomes even more effective because data paths don't have to wait for a shared bus.
Putting It All Together: How a Modern Computer System Works¶
(Refer to Figure 1.7 in your text for a visual representation of these interactions.)
The figure shows the interplay between all components:
- The CPU follows its thread of execution, going through the instruction execution cycle (fetch, decode, execute).
- It accesses instructions and data from memory (often via a cache for speed).
- When an I/O request is made, the DMA controller handles the bulk data movement between the device and memory.
- Once the transfer is complete, an interrupt is sent to the CPU, which then resumes its work.
This coordinated effort, managed by the operating system, allows the computer to perform efficiently, keeping the CPU busy while I/O operations happen in the background.
1.3 Computer-System Architecture¶
Introduction: Categorizing Systems by Processors¶
In the previous section, we looked at the general components of a computer system (CPU, memory, I/O). Now, we'll see how these components can be organized in different ways. The primary way to categorize computer systems is by the number of general-purpose processors they use.
1.3.1 Single-Processor Systems¶
The Traditional Single-Core CPU¶
Traditionally, most computers were single-processor systems. This means they had one general-purpose processor (one CPU) containing a single processing core. The core is the part of the processor that actually executes instructions and has its own set of local registers for storing data. This single main CPU core is what runs the operating system and application processes by executing a general-purpose instruction set.
The Role of Special-Purpose Processors¶
Even in these so-called "single-processor" systems, there are often many other processors! These are special-purpose processors designed for specific tasks. They are not general-purpose CPUs. Examples include:
- Disk controller microprocessors
- Keyboard controller microprocessors
- Graphics controller (GPU) processors
These special-purpose processors have two key characteristics:
- They run a very limited, specialized instruction set.
- They do not run user processes like a web browser or a word processor.
How the OS Manages Special-Purpose Processors¶
The operating system's relationship with these processors varies:
Managed by the OS: In many cases, the OS directly manages them. The OS sends them commands and monitors their status.
- Example - Disk Controller: The main CPU tells the disk controller microprocessor to read a certain block of data. The disk controller then handles the complex, low-level task of moving the read head, reading the data, and managing its own queue of requests. This offloads work from the main CPU, freeing it up for other tasks. This is a form of hardware-level synchronization.
Autonomous Hardware Components: In other cases, these processors are low-level components that work completely independently. The operating system does not communicate with them directly.
- Example - Keyboard Controller: A microprocessor in your keyboard constantly scans the key matrix. When you press a key, it autonomously converts the physical keypress into a scan code and sends it to the main CPU. The OS doesn't tell it how to do this; it just receives the result.
The Key Definition of a Single-Processor System¶
The critical point is this: The presence of these special-purpose microprocessors does NOT make a system a multiprocessor system.
The definition is strict: If a system has only one general-purpose CPU with a single processing core, it is a single-processor system, regardless of how many other special-purpose chips it has.
Because of this definition, very few modern computers are truly single-processor systems. Almost all contemporary devices—from smartphones to laptops—use CPUs with multiple cores, which we will discuss next. This section describes an older, but foundational, architectural model.
1.3.2 Multiprocessor Systems¶
Introduction: The Dominance of Multiprocessing¶
On virtually all modern computers, from smartphones to servers, multiprocessor systems are the standard. These systems have multiple processing units that work together. The main goal is to increase throughput—getting more work done in less time. While adding a second processor doesn't double the speed (due to coordination overhead), it significantly boosts performance.
Symmetric Multiprocessing (SMP)¶
The most common type of multiprocessor system uses Symmetric Multiprocessing (SMP). In an SMP system:
- There are two or more identical, peer CPUs.
- Each CPU is independent and has its own set of registers and a private cache (often called an L1 cache).
- All CPUs share the same physical memory and I/O devices, connected via a common system bus.
(Refer to Figure 1.8 in your text for a visual of this architecture).
How it works: In SMP, every processor can perform any task, whether it's running the operating system kernel or a user application. This allows N processes to run truly in parallel if there are N CPUs.
The Challenge: Because the CPUs are separate, the system can become unbalanced. One CPU might be idle while another is overloaded. To prevent this, the operating system must use shared data structures to distribute the workload dynamically among all processors. This requires careful programming to avoid conflicts, a topic covered in Chapters 5 and 6.
The Evolution: Multicore Systems¶
The definition of a multiprocessor has evolved. Instead of having multiple separate processor chips, we now mostly have multicore systems, where multiple computing cores reside on a single physical chip.
- Core: The basic computation unit of a CPU.
- Multicore: A single processor chip that contains multiple cores.
(Refer to Figure 1.9 in your text for a dual-core design).
Advantages of Multicore:
- Efficiency: Communication between cores on the same chip is much faster than communication between separate chips.
- Power Savings: One chip with multiple cores uses significantly less power than multiple single-core chips. This is crucial for mobile devices and laptops.
Typical Multicore Design: Each core has its own private L1 cache. They often share a larger L2 cache on the same chip. This combines the speed of private caches with the capacity of a shared cache.
From the operating system's perspective, a multicore processor with N cores looks exactly like N standard CPUs. This places a major responsibility on the OS (and application programmers) to efficiently schedule tasks across all available cores. All modern operating systems (Windows, macOS, Linux, Android, iOS) support SMP on multicore systems.
Definitions of Computer System Components¶
To avoid confusion, here are the precise definitions the text will use:
- CPU (Central Processing Unit): The hardware that executes instructions. We use this as a general term for a single computational unit.
- Processor: A physical chip that contains one or more CPUs.
- Core: The basic computation unit of the CPU. A single computing engine.
- Multicore: Including multiple computing cores on the same CPU chip.
- Multiprocessor: A system that includes multiple processors.
Important Note: Since almost all systems are now multicore, we will use "CPU" loosely to mean a computational unit. We will use "core" and "multicore" when specifically referring to the architecture of a single chip.
Scaling Beyond the Bus: Non-Uniform Memory Access (NUMA)¶
There's a limit to how well SMP systems can scale. As you add more and more CPUs, they all compete for access to the shared memory over the single system bus, which becomes a bottleneck.
NUMA (Non-Uniform Memory Access) is an advanced architecture designed to solve this scaling problem.
- How it works: In a NUMA system, each CPU (or group of CPUs) has its own local memory. The CPUs are connected by a high-speed system interconnect.
- The "Non-Uniform" Part: The key characteristic is that access time to memory is not uniform.
- Accessing local memory (the memory attached to your own CPU) is very fast.
- Accessing remote memory (memory attached to another CPU) is slower because it has to travel across the interconnect.
(Refer to Figure 1.10 in your text for a visual of the NUMA architecture).
Advantage: NUMA systems can scale to a much larger number of processors because there is less contention for a single memory bus. Disadvantage: Performance can suffer if a process running on one CPU needs to frequently access data in another CPU's memory. The OS must be smart about CPU scheduling and memory management (discussed in Section 5.5.2 and Section 10.5.4) to keep a process and its memory on the same node as much as possible. NUMA is very popular in high-end servers.
Blade Servers¶
Finally, blade servers represent another multiprocessor design. In a blade server chassis, multiple independent processor boards (blades) are stacked together. The key difference is that:
- Each blade boots independently and runs its own instance of an operating system.
- Some blades can themselves be multiprocessor systems.
This blurs the line between a single computer and a cluster of computers. Essentially, a blade server is a collection of multiple independent multiprocessor systems sharing a single chassis for power and networking.
1.3.3 Clustered Systems¶
What is a Clustered System?¶
A clustered system is another form of a multiprocessor system, but it's structured differently from the tightly-coupled SMP and NUMA systems we just discussed. Instead of multiple CPUs sharing memory inside one computer, a cluster is made up of two or more individual systems (called nodes) that are linked together. Each node is typically a complete, independent computer, often a multicore system itself. These systems are considered loosely coupled.
The exact definition of a cluster can be fuzzy, but the generally accepted one is that clustered computers share storage and are closely linked via a network like a LAN (Local-Area Network) or a very fast interconnect like InfiniBand.
(Refer to Figure 1.11 in your text for the general structure of a clustered system).
Primary Goal: High-Availability¶
The most common reason for building a cluster is to provide high-availability service. This means the service provided by the cluster will continue to operate even if one or more of its nodes fails.
How does it achieve this? Through redundancy.
- Special cluster software runs on each node.
- The nodes constantly monitor each other over the network (a "heartbeat").
- If a node fails, a monitoring node can take over its work: it takes ownership of the failed node's storage and restarts its applications.
- From a user's perspective, this results in only a brief interruption of service.
This ability to continue service is a form of graceful degradation. Some highly robust clusters are fault tolerant, meaning they can survive the failure of any single component without any interruption in service. Fault tolerance requires sophisticated mechanisms to detect, diagnose, and correct failures automatically.
Types of Clustering: Asymmetric vs. Symmetric¶
Asymmetric Clustering (Active-Passive):
- One machine (the active server) runs the applications.
- The other machine is in hot-standby mode. It does nothing but monitor the active server.
- If the active server fails, the hot-standby host becomes active.
- This is simple but inefficient because the standby hardware is idle until a failure occurs.
Symmetric Clustering (Active-Active):
- Two or more hosts are both running applications and monitoring each other.
- This is more efficient because it uses all of the available hardware.
- It requires that there are multiple applications to run, so if one node fails, its workload can be distributed across the remaining active nodes.
Secondary Goal: High-Performance Computing¶
Because a cluster is a group of computers connected by a network, it can also be used to tackle massive computational problems. The combined power of all the nodes can far exceed that of a single-processor or even a large SMP system.
To use a cluster this way, the application must be specially designed using a technique called parallelization. This means the program is split into separate components that can run simultaneously on different nodes in the cluster. Each node works on its part of the problem, and the results are combined at the end for a final solution.
Parallel Clusters and Shared Data¶
A common type of high-performance cluster is a parallel cluster, where multiple hosts need to access the same data on shared storage. This is complex because most standard operating systems aren't designed for multiple computers to simultaneously read from and write to the same disk.
To make this work, we need:
- Special Software: Special versions of applications and operating systems are required. For example, Oracle Real Application Clusters (RAC) is a database version designed for parallel clusters.
- Distributed Lock Manager (DLM): This is a crucial software component that controls access to the shared data. It provides access control and locking to prevent conflicts when multiple nodes try to modify the same data at the same time. The DLM ensures data remains consistent.
The Role of Storage-Area Networks (SANs)¶
Cluster technology is rapidly evolving, enabled in large part by Storage-Area Networks (SANs) (covered in Section 11.7.4). A SAN is a dedicated, high-speed network that provides multiple servers with access to a shared pool of storage devices.
How SANs help clustering:
- The applications and data reside on the SAN, not on any individual node.
- Any node in the cluster can run an application because they all have equal access to the data on the SAN.
- This makes failover seamless. If a host fails, any other host can immediately take over, as it already has access to the necessary data and applications.
- This allows for very large-scale database clusters where dozens of hosts can work on the same database, boosting both performance and reliability.
PC Motherboard (Sidebar)¶
The text includes a sidebar on a PC motherboard to connect these abstract concepts to physical hardware.
- A desktop PC motherboard with a processor socket, DRAM slots, and I/O connectors is a fully functioning computer once assembled.
- Even low-cost CPUs today contain multiple cores.
- Some motherboards have multiple processor sockets, creating an SMP system.
- More advanced systems with multiple system boards can create NUMA systems.
This highlights that the architectures discussed (single-core, multicore, SMP, NUMA) are all built upon the same fundamental physical components.
1.4 Operating-System Operations¶
Introduction: The OS as an Execution Environment¶
We've covered the computer's hardware; now let's talk about the software that brings it to life: the operating system. The OS creates the environment where programs can run. While different OSes are built in different ways, they share common fundamental operations.
Booting the System: The Bootstrap Program¶
For a computer to start—when you power it on or reboot it—it needs an initial program to run. This is the bootstrap program (often called "the BIOS" or "UEFI firmware" in PCs).
- Where is it stored? It's stored in firmware (like EEPROM) on the computer's hardware, so it's available immediately when power is applied.
- What does it do? It's a simple program that performs the initial "wake-up" sequence:
- It initializes all hardware components: CPU registers, device controllers, and memory.
- Its most important job is to locate the operating system kernel on a storage device (like a hard drive or SSD) and load it into memory.
- Once the kernel is loaded into memory and the CPU starts executing it, the OS takes over.
System Startup: The Kernel and Daemons¶
After the bootstrap program loads the OS kernel, the kernel begins providing services. However, not all system services run inside the kernel itself.
- System Daemons: Many services are started by system programs that are loaded at boot time. These programs run in the background as long as the system is on and are called daemons (in UNIX/Linux) or services (in Windows).
- Example - systemd: On modern Linux systems, the first program started after the kernel is typically
systemd. Its job is to start all the other necessary daemons, like a network manager, a scheduler, and a logging service. - Once all these daemons are running, the system is considered fully booted and waits for events.
Event-Driven Execution: The Role of Interrupts¶
When the system is fully booted, what does the OS do? If there's nothing to run, no I/O to handle, and no user input, the OS simply waits. It is an event-driven system. Almost all events are signaled by an interrupt.
We learned about hardware interrupts in Section 1.2.1 (e.g., a disk controller signaling that a data transfer is complete). Now we introduce a second, crucial type:
Trap (or Exception): A software-generated interrupt. There are two main causes of a trap:
- An Error: For example, if a program tries to divide by zero or access memory it doesn't have permission for, the CPU generates a trap. The OS then handles this error, often by terminating the offending program.
- A Service Request: This is the deliberate way a user program asks the OS to do something on its behalf. A program performs this request by executing a special instruction called a system call (e.g., to read a file, send data over a network, or create a new process). Executing a system call triggers a trap, which switches the CPU from user mode to kernel mode, allowing the OS to safely execute the requested service.
Hadoop (Sidebar)¶
Hadoop is a practical, real-world example of software designed for the clustered systems we just discussed. It's an open-source framework for processing massive data sets ("big data") across a cluster of inexpensive computers.
Key Characteristics:
- Designed for Clusters: It scales from one machine to thousands.
- Manages Parallelism: It assigns tasks to nodes and manages communication between them to process data in parallel.
- Provides Reliability: It automatically detects and handles node failures, making the entire cluster highly reliable.
Hadoop is organized into three core components:
- Distributed File System (HDFS): Manages files and data spread across all the nodes in the cluster.
- YARN ("Yet Another Resource Negotiator"): Acts as the cluster's operating system. It manages resources (CPU, memory) and schedules tasks on the nodes.
- MapReduce: A programming model that allows problems to be broken down into parts that can be processed in parallel on different nodes. The "Map" step processes the data, and the "Reduce" step combines the results into a final answer.
Hadoop typically runs on Linux, and applications can be written in various languages, with Java being a popular choice due to extensive libraries.
1.4.1 Multiprogramming and Multitasking¶
The Need for Running Multiple Programs¶
A fundamental goal of operating systems is to maximize the use of the CPU. A single program can rarely keep the CPU or I/O devices busy 100% of the time. Furthermore, users want the ability to run more than one program at once. Multiprogramming solves this by ensuring the CPU always has a program to execute, thereby increasing CPU utilization and keeping the user productive.
In a multiprogrammed system, a program that is loaded into memory and executing is called a process.
How Multiprogramming Works¶
The core idea is straightforward:
- The operating system keeps several processes in memory at the same time. (Refer to Figure 1.12 for a visual of the memory layout).
- The OS begins executing one process.
- Eventually, that process will have to wait for something, like an I/O operation (e.g., reading a file from a slow disk).
- Instead of letting the CPU sit idle, the OS switches to and executes another process that is ready to run.
- When that process needs to wait, the CPU switches to yet another process.
- This continues, and when the first process finishes waiting, it gets the CPU back.
As long as there is at least one process that can execute, the CPU is never idle.
Analogy: Think of a lawyer working on multiple cases. While one case is waiting for a court date or for documents to be prepared, the lawyer works on another case. If the lawyer has enough clients, she is never idle.
From Multiprogramming to Multitasking¶
Multitasking (or time-sharing) is a direct extension of multiprogramming. The key difference is the frequency of switching and the primary goal:
- In multiprogramming, the goal is to maximize CPU utilization (a system-oriented goal).
- In multitasking, the goal is to provide a fast response time to the user (a user-oriented goal).
Why is frequent switching needed? Interactive processes (like a text editor or a web browser) spend a lot of time waiting for user input. User input is incredibly slow from a computer's perspective (e.g., typing at 7 characters per second). Instead of letting the CPU idle during this wait, a multitasking OS rapidly switches the CPU to another process. This happens so quickly that it gives the illusion that all programs are running simultaneously, providing a responsive user experience.
The OS Mechanisms Required for Multiprogramming/Multitasking¶
Running multiple processes concurrently is complex and requires the OS to provide several key features:
Memory Management (Chapters 9 & 10): Having multiple processes in memory at once requires the OS to allocate memory to each process, protect each process's memory from others, and manage the movement of processes between memory and disk. This leads to the concept of virtual memory, which allows a program to run even if it's not entirely in physical RAM, making the system more flexible.
CPU Scheduling (Chapter 5): When more than one process is ready to run, the OS must decide which one to run next. The algorithm that makes this decision is called the scheduler.
Protection (Chapter 17): The OS must ensure that processes cannot interfere with each other or with the OS itself. This involves protecting resources like memory, the CPU, and I/O devices.
Process Synchronization and Communication (Chapters 6 & 7): When processes need to interact (e.g., to share data), the OS must provide mechanisms to allow them to coordinate their activities safely to avoid corrupting data.
Deadlock Handling (Chapter 8): The OS must manage the system to prevent or resolve deadlocks, a situation where two or more processes are stuck forever, each waiting for a resource held by the other.
File Systems and Storage Management (Chapters 11, 13, 14, 15): Programs need to store and retrieve data permanently. The OS provides a file system on secondary storage (like hard drives) to manage this data in a structured way.
In summary, the simple goal of "running multiple programs at once" forces the operating system to become a sophisticated manager of all the system's resources. The rest of the textbook essentially details how the OS implements each of these required mechanisms.
1.4.2 Dual-Mode and Multimode Operation¶
The Need for Protection¶
The operating system and all user applications share the same hardware. A critical job of the OS is to ensure that a faulty or malicious user program cannot disrupt the operation of other programs or the OS itself. To achieve this, the system must have a clear way to distinguish between code that is part of the trusted operating system and code that belongs to a regular user application. This is accomplished through hardware-supported modes of execution.
Dual-Mode Operation¶
The fundamental design is dual-mode operation, which provides two separate modes:
- Kernel Mode (Privileged Mode): Also known as supervisor mode or system mode. The OS runs in this mode. Code executing in kernel mode has complete, unrestricted access to all hardware and can execute every instruction in the CPU's instruction set.
- User Mode: User applications run in this mode. Code executing in user mode has restricted access. It cannot directly perform operations that affect the system's overall state.
A hardware bit, called the mode bit, is used to indicate the current mode. It is typically set to 0 for kernel mode and 1 for user mode.
How the Transition Works¶
(Refer to Figure 1.13 for a visual diagram of this process).
- Boot Time: The hardware starts in kernel mode. The OS loads and then starts the first user application, switching the mode bit to user mode.
- User Process Execution: The user process runs in user mode.
- System Call Request: When a user process needs an OS service (like reading a file), it executes a system call. This is a special instruction that triggers a trap (a software interrupt).
- Switch to Kernel Mode: The hardware handles the trap by:
- Switching the mode bit from 1 (user) to 0 (kernel).
- Saving the current state of the user process.
- Transferring control to a predefined interrupt vector location, which points to the appropriate OS service routine (the system-call handler).
- Execute in Kernel Mode: The OS code now executes in kernel mode, with full privileges, to perform the requested service (e.g., accessing the disk controller).
- Return to User Mode: Once the service is complete, the OS executes a special instruction that:
- Switches the mode bit back to 1 (user).
- Restores the saved state of the user process.
- Returns control to the instruction immediately after the system call in the user program.
This transition also happens on hardware interrupts (like a timer interrupt or I/O completion) and other traps (like division by zero errors).
Privileged Instructions: The Enforcer of Protection¶
The mechanism that enforces this protection is the concept of privileged instructions. These are machine instructions that have the potential to cause harm (e.g., directly controlling I/O devices, managing memory, halting the CPU) and are designed to execute only in kernel mode.
- If a user program attempts to execute a privileged instruction, the hardware does not execute it. Instead, it generates a trap, handing control to the OS. The OS then typically terminates the offending program for violating the rules.
- The instruction to switch to kernel mode is itself a privileged instruction.
Beyond Two Modes: Multimode Operation¶
While dual-mode is the foundation, some systems use more than two modes for finer-grained control:
- Intel x86 CPUs have four privilege rings (0 to 3). Ring 0 is kernel mode, and Ring 3 is user mode. Rings 1 and 2 are rarely used in practice.
- ARMv8 CPUs have seven different modes.
- Virtualization Support: CPUs that support hardware virtualization (Section 18.1) often have a separate mode for the Virtual Machine Manager (VMM). The VMM runs with more privilege than a user process but less than the full kernel, allowing it to create and manage virtual machines safely.
The Complete Life Cycle of Instruction Execution¶
We can now describe the complete cycle:
- Control starts with the OS in kernel mode.
- Control is passed to a user application, and the mode is set to user mode.
- Control is returned to the OS via an interrupt, trap, or system call, switching the mode back to kernel mode.
- The OS handles the event and may return control to the same user process or a different one, switching back to user mode.
Handling Errors¶
This hardware protection also catches program errors. If a user program tries to execute an illegal instruction or access memory outside its allocated space, the hardware traps to the OS. The OS then handles this error, which usually means terminating the program abnormally. It may produce an error message and a memory dump (a file containing the program's memory state at the time of the crash) to aid in debugging.
In summary, dual-mode operation is the fundamental hardware mechanism that allows the operating system to maintain ultimate control over the computer, protecting itself and user programs from each other.
1.4.3 Timer¶
The Problem: Maintaining CPU Control¶
Dual-mode operation protects the OS from a program that tries to execute a bad instruction, but what protects the system from a program that simply never gives up the CPU? A user program could accidentally get stuck in an infinite loop or deliberately refuse to call system services, effectively freezing the machine and preventing the OS or any other program from running.
To prevent a single program from monopolizing the CPU, the operating system uses a timer.
How the Timer Works¶
A timer is a hardware device that interrupts the CPU after a specified period of time. The operating system uses it like an alarm clock to regain control.
There are two general types:
- Fixed-rate timer: Interrupts the CPU at a constant frequency (e.g., every 1/60th of a second).
- Variable timer: Can be set to interrupt after a specific, variable time interval.
Most modern systems use a variable timer, which is typically implemented with two components:
- A fixed-rate clock that ticks at a constant frequency (e.g., every 1 millisecond).
- A counter register that the operating system can set.
Here's the process:
- The OS decides how much time to give a program (its time quantum or time slice), say 100 milliseconds.
- Before switching to user mode and starting the program, the OS loads the value
100into the counter. - The program runs in user mode.
- With every tick of the clock (every 1 ms), the hardware automatically decrements the counter by one. The program is unaware this is happening.
- When the counter reaches
0, the timer hardware triggers an interrupt. - This interrupt forces the CPU to switch from user mode to kernel mode, just like a system call, and transfers control back to the operating system's scheduler.
The OS can then decide what to do next: it might give the same program another time slice, or it might switch to a different program that is waiting to run. This mechanism is the foundation of CPU scheduling and multitasking.
Example: If the system has a 10-bit counter and a 1-millisecond clock, the counter can hold values from 0 to 1023. This means the OS can set the timer to interrupt after any interval from 1 ms to 1024 ms.
Timer Implementation in Linux¶
The text provides a real-world example with the Linux kernel:
HZ: This is a kernel configuration value that defines the frequency of the timer interrupts. For example, ifHZ = 250, the timer interrupts the CPU 250 times per second, meaning an interrupt occurs every 4 milliseconds. This value can vary based on the system.jiffies: This is a kernel variable that counts the total number of timer interrupts that have occurred since the system was booted. IfHZis 250, thenjiffiesincreases by 250 every second. It's like a system-wide tick counter.
Privileged Instruction¶
Crucially, the instructions that load a new value into the timer counter are privileged instructions. A user program cannot set or modify the timer. If it could, a malicious program could set the timer to a huge value (like 100 years) and effectively disable the OS's ability to regain control, taking over the machine. Therefore, only the OS, running in kernel mode, is allowed to set the timer.
1.5 Resource Management¶
An operating system is fundamentally a resource manager. Its primary job is to manage the various hardware and software resources of a computer system efficiently and fairly. The key resources it manages are:
- The CPU (or CPUs)
- Memory (RAM)
- File-storage space (on disks)
- I/O devices (like keyboards, mice, and network cards)
This section introduces how the OS manages these resources, starting with the most central concept: the process.
1.5.1 Process Management¶
What is a Process?¶
A program sitting on your disk (like chrome.exe or gcc) is just a file—a passive set of instructions. A process is a program in execution. It is an active entity.
- Analogy: A program is a recipe (a set of instructions). A process is the activity of a chef actually following that recipe, using ingredients (resources) like a bowl, a mixer, and an oven.
- Examples: When you double-click a web browser icon, you start a process. Every running application on your PC or phone is one or more processes.
A process is more than just the program code (which is called the text section). It also includes the current state of the activity: the values in the CPU registers, the program counter (which points to the next instruction to execute), and the contents of the stack and memory.
Process vs. Program¶
This is a critical distinction:
| Program | Process |
|---|---|
| Passive entity (a file on disk) | Active entity (an executing instance) |
| Static | Dynamic - its state changes as it runs |
| Exists once | Can have multiple instances (e.g., three separate processes for three separate browser windows) |
Even if two processes are running the exact same program (like two users both using the same text editor), they are considered separate execution sequences. Each has its own memory space, its own program counter, and its own set of resources.
Process Resources and Lifecycle¶
A process needs resources to accomplish its task:
- CPU time: To execute its instructions.
- Memory: To hold its code and data.
- Files: To read or write data.
- I/O devices: To interact with the world.
These resources are allocated to the process by the OS when it is started. The process may also be given input data. For example, a process starting a web browser is given a URL as input. When the process finishes its task and terminates, the OS reclaims all the resources so they can be used by other processes.
Single-Threaded vs. Multithreaded Processes¶
- A single-threaded process has a single program counter. This counter keeps track of the next instruction to execute. The execution is strictly sequential—one instruction after another.
- A multithreaded process has multiple program counters, one for each "thread of execution" within the process. Threads allow a single process to perform multiple tasks concurrently (e.g., a web browser downloading a file in one thread while displaying a page in another). We will cover threads in detail in Chapter 4.
The Operating System's Responsibilities in Process Management¶
The OS is responsible for all activities related to process management. Its key duties are:
- Process Creation and Deletion: The OS must be able to create new processes (for both users and the OS itself) and clean up after them when they terminate.
- Process Scheduling: With many processes wanting to run but only one or a few CPUs, the OS must decide which process runs on which CPU and for how long. This is the job of the CPU scheduler.
- Process Suspension and Resumption: The OS must be able to pause (suspend) a process's execution and later continue (resume) it from the same point. This happens constantly due to timer interrupts and I/O requests.
- Process Synchronization: When processes need to communicate or share resources, the OS must provide mechanisms to ensure they do so in an orderly and correct way, preventing chaos (like two processes trying to print to the same printer at the same time).
- Process Communication: The OS provides methods for processes to exchange information with each other.
These concepts form the core of modern operating systems and are explored in depth in Chapters 3 through 7.
1.5.2 Memory Management¶
The Central Role of Main Memory¶
As you learned in computer architecture, the main memory (RAM) is the central storage unit that the CPU interacts with directly. Think of it as the CPU's immediate workspace.
- It's a large array of bytes, where each byte has a unique address.
- It is a volatile, fast-access repository for data shared by the CPU and I/O devices.
- Following the von Neumann architecture, the CPU must fetch both instructions and data from main memory to execute a program. During the instruction-fetch cycle, it reads the next instruction from memory. During the data-fetch cycle, it reads or writes the data the instruction needs.
A crucial point: the CPU can only directly access data that is in main memory. If data is on a disk, it must first be transferred to RAM before the CPU can process it. Similarly, for a program to run, its instructions must be loaded into memory.
The Basic Memory Management Lifecycle¶
For a single program to run, the process is straightforward:
- Mapping and Loading: The program's instructions and data must be mapped from their relative locations in the program file to specific, absolute addresses in physical memory and then loaded into those locations.
- Execution: The CPU fetches and executes instructions, accessing memory using these absolute addresses.
- Termination: When the program ends, the memory space it occupied is marked as available for the next program.
The Need for Advanced Memory Management¶
The simple "one program at a time" model is inefficient. To improve CPU utilization and allow for multitasking (running several programs concurrently), the operating system must keep multiple programs in memory at the same time.
This creates several challenges that the OS must solve through memory management:
- How do we keep track of which parts of memory are free and which are in use?
- How do we allocate memory to a new process without interfering with existing processes?
- When memory is full, how do we decide which program (or part of a program) to remove to make space for a new one?
- How do we protect one process's memory from being accessed or overwritten by another process?
There are many different memory-management schemes (e.g., paging, segmentation), and their effectiveness depends on the situation and the hardware support available. The choice of algorithm is a major design decision for an OS.
The Operating System's Responsibilities in Memory Management¶
The OS is responsible for all activities related to managing the computer's memory. Its key duties are:
- Tracking Memory Usage: The OS must maintain a record of every memory location, knowing whether it is allocated or free, and if allocated, which process is using it. This is often done with data structures like bitmaps or linked lists.
- Allocating and Deallocating Memory: When a process requests memory (e.g., when it starts or needs to store more data), the OS must find a suitable block of free memory, allocate it to the process, and update its records. When a process releases memory (e.g., when it terminates), the OS must mark that memory as free again.
- Deciding What to Swap: When the system needs more memory than is physically available, the OS must decide which processes (or parts of processes) should be temporarily moved out of memory to a secondary storage area (like a swap space on disk) to free up space. Later, it must decide when to move them back into memory. This process is crucial for running large programs or many programs simultaneously.
These techniques, which ensure efficient, fair, and safe use of memory, are discussed in detail in Chapters 9 and 10.
1.5.3 File-System Management¶
The Goal: A Logical View of Storage¶
Dealing directly with the physical properties of storage devices (like magnetic platters on a hard drive or memory cells in an SSD) would be incredibly complex for users and programmers. The operating system solves this by providing a uniform, logical view of storage. It hides the messy hardware details behind a simple, abstract concept: the file.
A file is a logical storage unit—a collection of related information created by a user or program. The operating system is responsible for mapping these abstract files onto the actual physical media.
What is a File?¶
A file is an extremely general and powerful concept. It can contain anything:
- Programs: Source code (e.g.,
my_program.c) or executable object code (e.g.,my_program.exe). - Data: Numeric data, text documents, spreadsheets, images, music files (like MP3s), videos, etc.
Files can be:
- Free-form: Like a simple text file (.txt) where the content has little inherent structure.
- Rigidly formatted: Like a database file or an MP3 file, which must follow a specific internal structure for the data to be meaningful.
The Role of the Operating System¶
The OS provides a consistent way to work with files, regardless of the underlying physical media. These media can vary greatly in their characteristics:
- Type: Magnetic hard disk, solid-state drive (SSD), CD/DVD, USB flash drive, network storage.
- Properties: Access speed, capacity, data-transfer rate, and access method (sequential, like a tape; or random-access, like a disk).
The OS abstracts these differences away. The same system call (open, read, write) works for a file on an SSD, a hard drive, or a USB stick.
To help users and programs manage thousands of files, the OS organizes them into directories (or folders). Directories create a hierarchical structure, making it easier to locate and group related files.
Finally, when multiple users share a system, the OS must provide protection. It controls which users are allowed to access a file and what they are allowed to do with it (e.g., read, write, or execute).
The Operating System's Responsibilities in File Management¶
The OS is responsible for all activities related to managing files and storage. Its key duties are:
- Creating and Deleting Files: The OS provides the mechanism to create new files and delete them when they are no longer needed, managing the space on the storage device.
- Creating and Deleting Directories: The OS allows for the creation of directory structures to organize files and the removal of these directories when they are empty.
- Providing File and Directory Manipulation Primitives: The OS supplies fundamental operations (system calls) for working with files and directories. This includes:
- Opening and closing a file.
- Reading data from a file or writing data to a file.
- Repositioning within a file (seeking).
- Renaming files, moving files between directories, and listing directory contents.
- Mapping Files to Storage: This is a core function. The OS must keep track of which specific blocks of data on a physical storage device belong to which file. It manages the translation from a file's logical structure (a sequence of bytes) to the physical blocks on the disk.
- Backing Up Files: The OS often provides utilities to back up files to stable, non-volatile storage (like another disk or tape) to prevent data loss in case of hardware failure, accidental deletion, or other disasters. This ensures data persistence and integrity.
1.5.4 Mass-Storage Management¶
The Need for Secondary and Tertiary Storage¶
As we know from memory management, main memory (RAM) is volatile and too small to hold all the data and programs a system needs permanently. Therefore, we rely on non-volatile mass-storage devices to serve as the permanent, high-capacity repository for the system.
- Secondary Storage (e.g., HDDs, SSDs): This is the primary online storage used for active data and programs. Programs like your web browser or word processor are stored here until they are loaded into memory for execution. These devices are the source of data for processing and the destination for saving results. Because the CPU interacts with these devices so frequently, their efficient management is crucial for overall system performance.
- Tertiary Storage (e.g., Magnetic Tapes, Optical Discs): This is used for offline storage that is slower, lower cost, and often higher capacity. Its uses include:
- Creating backups of important data from secondary storage.
- Storing seldom-used data (archival storage).
- Long-term storage where immediate access is not required.
Tertiary storage is not directly involved in the day-to-day speed of the system, but it is still important for data integrity and archival purposes.
The Operating System's Responsibilities in Secondary Storage Management¶
The proper management of the secondary storage subsystem is a major task for the OS. Its responsibilities include:
- Mounting and Unmounting: This is the process of preparing a storage device (like a USB drive or a new hard disk) for use by the system (mounting) and properly disconnecting it when it's no longer needed (unmounting). This ensures data is written correctly and the filesystem is kept intact.
- Free-Space Management: The OS must keep track of all the free blocks on the storage device so it can quickly allocate space when a new file is created or an existing file grows.
- Storage Allocation: When a file is saved, the OS must decide which specific free blocks on the disk to assign to it. Different allocation methods (contiguous, linked, indexed) have different performance trade-offs.
- Disk Scheduling: This is a critical performance activity. When multiple requests are made to read from or write to the disk, the OS must decide the order in which to service these requests. The goal of the disk scheduling algorithm is to minimize the seek time of the disk arm, thereby maximizing the total throughput of the storage subsystem. The speed of the disk scheduler can directly impact how fast the entire computer feels.
- Partitioning: The OS allows a physical disk drive to be divided into logical sections called partitions. Each partition can be managed as a separate storage device, often with its own filesystem. This is useful for organizing data or installing multiple operating systems.
- Protection: The OS must enforce access-control mechanisms to ensure that unauthorized users cannot access files on secondary storage.
Tertiary Storage Management¶
While sometimes managed by dedicated applications, the operating system can also handle tertiary storage. Its tasks in this area include:
- Managing the insertion and removal (mounting/unmounting) of tapes or discs.
- Controlling which process gets exclusive access to a tertiary storage device.
- Automating the migration of data from secondary storage (e.g., a hard drive) to tertiary storage (e.g., a tape archive) based on policies.
1.5.5 Cache Management¶
The Caching Principle¶
Caching is a fundamental concept in computer systems designed to overcome speed mismatches between different components. The core idea is simple:
- Data is kept in a primary, larger, but slower storage system (e.g., main memory).
- When that data is used, a copy is placed into a smaller, but much faster, storage system called the cache.
- The next time the data is needed, the system first checks the cache.
- If the data is found there (a cache hit), it is used directly from the fast cache, saving significant time.
- If the data is not in the cache (a cache miss), it must be retrieved from the slower primary storage, and a copy is placed into the cache for future use.
This principle is based on the locality of reference, which means that programs tend to access the same data or instructions repeatedly over short periods of time.
The Storage Hierarchy¶
Caching creates a storage hierarchy, a pyramid of storage types where each level is smaller, faster, and more expensive per byte than the level below it. Figure 1.14 provides a detailed comparison of these levels.
Key Takeaway from Figure 1.14: As you move down the hierarchy, the access time increases dramatically (from 0.25 nanoseconds for registers to 5,000,000 ns for a disk), while the capacity increases.
Operating System's Role in Caching¶
The OS is primarily concerned with managing the software-controlled caches in the hierarchy, specifically:
- Main Memory as a Cache for Disk: The OS decides which parts of a file or program to keep in RAM, anticipating that they will be needed soon.
- Disk Caches: The OS may use a portion of main memory to cache frequently accessed disk blocks, speeding up disk I/O operations.
The main challenge in cache management is cache size is limited. When the cache is full and new data needs to be brought in, the OS must decide which old data to replace. The choice of this replacement policy (e.g., Least Recently Used - LRU) is critical for performance.
The Problem of Coherency¶
Caching introduces a major complication: multiple copies of the same data can exist simultaneously at different levels of the hierarchy.
Example (Follow Figure 1.15):
Imagine an integer A stored in a file on a magnetic disk.
- A program needs to increment
A. The OS loads the disk block containingAinto main memory. - The CPU then copies
Ainto the hardware cache. - Finally,
Ais loaded into a register where the increment operation (A+1) happens.
Now, there are four copies of A: on the disk, in RAM, in the cache, and in the register. After the increment, only the value in the register is correct. The others are now out-of-date (stale).
In a single-process system, this is manageable because all accesses will go to the highest-level (most recent) copy. However, it creates serious problems in more complex environments:
- Multitasking: If the OS switches the CPU to another process after
Ais incremented in the register but before it's written back to memory, the second process might read the old, stale value ofAfrom main memory. - Multiprocessor Systems: If multiple CPUs have their own caches,
Acould be copied into several caches. If one CPU updatesAin its local cache, the other CPUs will have stale copies. This is the cache coherency problem, which is typically solved by hardware protocols that invalidate or update all other copies when one is changed. - Distributed Systems: The problem is magnified when copies (replicas) of a file are stored on different computers across a network. Keeping all these geographically separated replicas consistent when updates occur is a complex challenge.
The OS must implement mechanisms, especially in memory and file management, to ensure that processes always see the most recent version of data, despite this complex caching hierarchy.
1.5.6 I/O System Management¶
The Goal: Hiding Hardware Peculiarities¶
I/O devices are incredibly diverse—keyboards, mice, disk drives, network cards, and graphics cards all function in very different ways. A key purpose of the operating system is to hide these specific hardware details, or "peculiarities," from users and applications. It provides a simple, uniform interface to perform I/O operations, regardless of the underlying device.
The I/O Subsystem¶
To achieve this, the OS contains a dedicated I/O subsystem. Think of it as a specialized department that handles all communication with the outside world. Its main components are:
A Memory-Management Component: This part deals with transferring data between devices and main memory efficiently. It uses several techniques:
- Buffering: Storing data temporarily in an area of memory (a buffer) to smooth out the speed differences between a fast CPU and a slow device. For example, data being sent to a printer is first stored in a buffer so the CPU doesn't have to wait for the slow printing process.
- Caching: Keeping a copy of frequently accessed data in a faster memory (as discussed in 1.5.5) to speed up I/O operations.
- Spooling: This is a high-level form of buffering used for devices like printers that cannot be multiplexed (you can't print lines from two different documents simultaneously). The spooler intercepts output for a device, stores each task as a separate file on disk, and then feeds them to the device one at a time. This allows multiple processes to "finish" their print jobs quickly, even though the physical printing happens sequentially.
A General Device-Driver Interface: This is a standard set of commands (an API) that the rest of the OS uses to talk to any device. It provides a common language for functions like
read,write, andopen.Device Drivers: For each specific type of hardware device, there is a device driver. The driver is a software module that understands the exact details and command set of its assigned device. It translates the generic requests from the general device-driver interface into the specific, low-level instructions that the hardware controller expects. Only the device driver needs to know the peculiarities of the device.
As discussed earlier in the chapter, this subsystem relies heavily on interrupts and DMA to handle I/O efficiently, freeing the CPU from being tied up.
1.6 Security and Protection¶
The Difference Between Protection and Security¶
In a multi-user, multitasking system, it is essential to control which processes can access which resources. The OS provides mechanisms for this, which fall into two related categories:
Protection: This is the internal mechanism for controlling the access of processes or users to the resources defined by the computer system. It is about ensuring that each component of a system uses only the resources it is authorized to use.
- Examples: Memory protection hardware ensures a process can only access its own memory space. The timer protects the CPU from being monopolized by a single process. Privileged instructions protect device controllers.
Security: This is the external and internal defense of the system from malicious attacks. Protection mechanisms are the tools used to build security. A system can have perfect protection mechanisms but still be insecure if, for example, a user's password is stolen.
- Examples of attacks: Viruses, worms, denial-of-service attacks, identity theft, theft of service.
In short, protection is about ensuring controlled access; security is about defending against attackers.
User and Group Identities¶
For protection and security to work, the system must be able to identify who is who. The OS maintains a database of users:
- User ID (UID): A unique numerical identifier assigned to each user. In Windows, this is called a Security ID (SID). When a user logs in, all processes created by that user are "tagged" with this ID.
- Group ID (GID): Users can be organized into groups. This allows the OS to manage permissions for collections of users efficiently (e.g., granting read access to a file to everyone in the "students" group). A user can belong to one or more groups.
Privilege Escalation¶
Sometimes a user needs to temporarily perform an action that requires higher privileges than they normally have (e.g., changing their password, which requires writing to the system password file).
Operating systems provide mechanisms for privilege escalation. A common example in UNIX-like systems is the setuid (set user ID) attribute. When a program with the setuid bit enabled is executed, it does not run with the user's ID, but with the ID of the program's owner (often the root/administrator). This allows a regular user to execute specific, privileged operations in a controlled way. The process runs with this effective UID until it relinquishes the privilege or ends.
These concepts of protection and security are explored in depth in Chapters 16 and 17.
1.7 Virtualization¶
What is Virtualization?¶
Virtualization is a technology that takes the physical hardware of a single computer—the CPU, memory, disks, and network cards—and abstracts it to create multiple, isolated, virtual execution environments. Each of these environments, called a virtual machine (VM), behaves like a separate private computer, complete with its own operating system.
A user can run multiple, different operating systems (like Windows, Linux, and macOS) simultaneously on the same physical machine and switch between them just like switching between application windows.
Virtualization vs. Emulation¶
It's important to distinguish virtualization from a related concept: emulation.
Emulation: This involves simulating the hardware of one type of CPU on a different type of CPU. The emulator translates every instruction from the "guest" system into instructions the "host" system's CPU can understand.
- Example: When Apple switched from PowerPC to Intel processors, they provided "Rosetta," which emulated a PowerPC CPU on an Intel CPU, allowing old applications to run.
- Drawback: Emulation is computationally expensive because of the translation process, leading to significant performance loss.
Virtualization: This requires the guest operating system to be compiled for the same CPU architecture as the host machine. The virtualization software does not need to simulate a different CPU; it simply manages access to the real, physical CPU.
- Benefit: Because the guest OS is running on native hardware, performance is much higher than with emulation.
How Virtualization Works: The Virtual Machine Manager¶
The core software that enables virtualization is called the Virtual Machine Manager (VMM), also known as a hypervisor.
Refer to Figure 1.16 for a visual comparison:
- (a) Traditional System: A single operating system kernel manages hardware and runs multiple processes.
- (b) Virtualized System: A Virtual Machine Manager runs directly on the hardware. The VMM then creates and manages multiple virtual machines (VM1, VM2, VM3). Each VM runs its own, full operating system kernel and its own set of processes.
The VMM is responsible for:
- Resource Allocation: It allocates shares of the physical CPU, memory, and I/O devices to each virtual machine.
- Isolation and Protection: It ensures that each virtual machine is isolated from the others. A crash or problem in one VM does not affect the others.
Why is Virtualization Useful?¶
Even though modern OSs are great at multitasking, virtualization is extremely important for several reasons:
- Consolidation: In data centers, instead of running one application per physical server (which wastes resources), many virtual machines can run on a single, powerful physical server. This reduces hardware costs, power consumption, and physical space needs.
- Development and Testing: Software developers can test their applications on different operating systems (Windows, Linux, etc.) all on a single laptop or server.
- Legacy Application Support: A business can run an old application that only works on Windows XP inside a Windows XP virtual machine on a modern computer.
- Cloud Computing: The entire cloud infrastructure is built on virtualization. When you rent a server from a cloud provider, you are almost always getting a virtual machine.
Types of Virtualization¶
The text mentions an evolution:
- Hosted VMMs (Type 2): The VMM runs as an application on top of a host operating system (e.g., VMware Workstation, Oracle VirtualBox). This is common for desktop use.
- Bare-Metal VMMs (Type 1): The VMM is installed directly on the physical hardware and acts as the host operating system itself (e.g., VMware ESXi, Citrix XenServer). This is common in data centers for maximum performance and efficiency.
Virtualization is a deep topic, and its full implementation details are covered in Chapter 18.
1.8 Distributed Systems¶
What is a Distributed System?¶
A distributed system is a group of independent, physically separate computers that are connected by a network. These computers work together to appear as a single, coherent system to the user. The key idea is that multiple machines, which might be different from each other (heterogeneous), collaborate to provide services.
The goals of creating a distributed system are to improve:
- Computation Speed: Tasks can be split up and run on multiple computers in parallel.
- Functionality: Users can access services and resources that aren't available on their local machine.
- Data Availability: Data can be replicated across multiple machines, so it's still accessible even if one machine fails.
- Reliability: If one computer fails, the system as a whole can continue to operate using the remaining computers.
The Role of Networking¶
A network is the fundamental communication path that makes distributed systems possible. It is simply a connection between two or more computers. The operating system handles network access in different ways:
- Some OSs make networking look like file access (e.g., you can access a remote file as if it were on your local disk using a network file system like NFS).
- Other times, users or applications must explicitly use network functions (e.g., using an FTP client to transfer a file).
Most systems support a mix of these approaches. The most common network protocol suite is TCP/IP, which forms the backbone of the internet. From the OS's perspective, a network is managed by a network interface card (NIC) and its corresponding device driver.
Types of Networks¶
Networks are categorized based on the geographical distance they cover:
| Type | Name | Typical Range | Example |
|---|---|---|---|
| PAN | Personal-Area Network | Several feet | Connecting a wireless headset to a phone via Bluetooth |
| LAN | Local-Area Network | Room, Building, Campus | A network connecting all computers in a university department using Ethernet or Wi-Fi |
| MAN | Metropolitan-Area Network | A city | A network connecting libraries across a city |
| WAN | Wide-Area Network | Country, Global | The internet, or a private network connecting a company's offices worldwide |
Networks can use various media to transmit data, including copper wires, fiber optic cables, and wireless transmissions (radio waves, microwaves, satellites).
Network Operating Systems vs. Distributed Operating Systems¶
There's a spectrum of how closely integrated the computers in a network are:
Network Operating System:
- In this model, each computer is autonomous and runs its own independent operating system.
- The OS is "network-aware"—it provides features that allow it to share files and exchange messages with other computers on the network.
- However, users are typically aware that they are accessing remote resources. For example, they might have to log into a remote machine explicitly to use its files.
- This is a loosely coupled system.
Distributed Operating System:
- This is a more advanced, tightly coupled model.
- The different computers communicate so closely that they create the illusion of a single, unified operating system controlling the entire network.
- A user doesn't need to know where a file is stored or which CPU is executing a program; the system handles all of this transparently.
- This is much more complex to implement but provides a simpler experience for the user.
The concepts of networking and distributed systems are explored in detail in Chapter 19.
1.9 Kernel Data Structures¶
The efficiency of an operating system depends not just on its algorithms but also on the data structures it uses to organize information. This section covers fundamental data structures that are ubiquitous in kernel code.
1.9.1 Lists, Stacks, and Queues¶
The Limitation of Arrays¶
An array is a simple data structure where any element can be accessed directly by its index, much like how main memory is addressed. However, arrays have limitations for OS tasks:
- They are inefficient for storing data items of varying sizes.
- Inserting or deleting an item in the middle of an array requires shifting all subsequent elements, which is computationally expensive.
Linked Lists¶
To overcome these limitations, operating systems heavily use lists. In a list, items are accessed sequentially. The most common implementation is a linked list, where each item (or node) contains data and a pointer to the next node.
There are several types of linked lists, as shown in the figures:
- Singly Linked List (Figure 1.17): Each node points to the next node in the list. The last node points to
NULL, indicating the end of the list.
- Doubly Linked List (Figure 1.18): Each node has pointers to both its predecessor (previous) and successor (next) node. This allows for traversal in both directions but requires more memory per node.
- Circularly Linked List (Figure 1.19): The last node points back to the first node, creating a circle. This is useful for applications that cycle through data continuously.
Advantages of Linked Lists:
- They can easily handle items of different sizes.
- Insertion and deletion are very efficient (O(1)) once the position is found, as they only require updating a few pointers.
Disadvantage of Linked Lists:
- Finding a specific item requires traversing the list from the beginning, which has linear time complexity - O(n).
Stacks¶
A stack is a data structure that follows the Last-In, First-Out (LIFO) principle. Think of a stack of plates: you add a plate to the top, and you take a plate from the top.
The two fundamental operations are:
- Push: Add an item to the top of the stack.
- Pop: Remove the top item from the stack.
Operating System Use Case: Stacks are crucial for managing function calls. When a function is called, the OS pushes the return address (where to go back to), parameters, and local variables onto a region of memory called the call stack. When the function returns, these items are popped off the stack, and the CPU resumes from the return address.
Queues¶
A queue is a data structure that follows the First-In, First-Out (FIFO) principle. Think of a line of people waiting for a service: the first person to join the line is the first one to be served.
Operating System Use Cases: Queues are everywhere in operating systems:
- Printer Queue: Print jobs are sent to a queue and printed in the order they were received.
- CPU Scheduling: As you will see in Chapter 5, processes that are ready to run are placed in a ready queue. The scheduler selects the next process from the front of this queue to run on the CPU.
- I/O Device Waiting: Processes waiting for an I/O device (like a disk) are placed in a device queue.
These simple data structures form the building blocks for more complex kernel components, allowing the OS to manage resources and control flow efficiently.
1.9.2 Trees¶
What is a Tree?¶
A tree is a hierarchical data structure. Data is organized in nodes connected by parent-child relationships, much like a family tree or a company's organizational chart.
- General Tree: A parent node can have any number of child nodes.
- Binary Tree: A more restricted and common form where a parent node can have at most two children, typically called the left child and the right child.
Binary Search Trees (BST)¶
A Binary Search Tree (BST) is a binary tree with an important ordering property: for any given node, the values in its left subtree are all less than or equal to the node's value, and the values in its right subtree are all greater than the node's value.
Refer to Figure 1.20 for an example. The root node holds the value 17. All values in its left subtree (6, 12, 14) are less than 17. All values in its right subtree (35, 38, 40) are greater than 17. This property holds for every node in the tree.
This structure allows for efficient searching. To find a value, you start at the root and compare the target value to the current node. If it's smaller, you go to the left child; if it's larger, you go to the right child. You repeat this process until you find the value or reach a NULL pointer.
- Worst-Case Performance: If items are inserted in sorted order (e.g., 1, 2, 3, 4), the tree effectively becomes a linked list. Searching for an item in this degenerate case has a worst-case performance of O(n).
Balanced Binary Search Trees¶
To avoid the worst-case scenario, we use balanced binary search trees. These trees use algorithms to ensure that the tree remains "bushy" and does not become a long chain. In a balanced tree with n items, the maximum number of levels from the root to a leaf is proportional to log₂ n (written as O(log n)).
This guarantees that the worst-case search, insertion, and deletion times are O(log n), which is very efficient even for large values of n.
Operating System Use Case: The text mentions that the Linux kernel uses a specific type of balanced BST called a red-black tree in its CPU-scheduling algorithm (specifically, the Completely Fair Scheduler). This allows it to efficiently manage and select the next process to run.
1.9.3 Hash Functions and Maps¶
The Goal: Constant-Time Access¶
Searching through a list or even a balanced tree requires multiple steps (O(n) or O(log n)). A hash function is a technique that aims to achieve constant-time O(1) data retrieval, meaning the time to find an item is (ideally) the same regardless of how many items are stored.
How Hashing Works¶
A hash function takes a piece of data (like a string or a number) as input, performs a calculation, and outputs a numeric value called a hash value or hash code.
This hash value is then used as an index into a table (usually an array) to directly locate the data. Instead of searching through all items, you compute the hash and go straight to the corresponding array slot.
Hash Collisions¶
A potential problem is that two different inputs might produce the same hash value. This is called a hash collision.
- Solution: The common solution is to have each slot in the hash table hold a linked list. All items that hash to the same index are stored in a list at that location. When retrieving an item, the hash function points you to the correct list, and then you perform a (hopefully short) linear search within that list.
- Efficiency: The quality of a hash function is measured by how well it distributes items evenly across the table, minimizing collisions. A good hash function with few collisions provides performance close to O(1). A bad one can degrade to O(n).
Hash Maps¶
A hash map (or hash table) is a data structure that uses hashing to store and retrieve [key: value] pairs.
Operating System Use Case: The text provides a classic example: user authentication.
- The system stores a table of
[username: password]pairs. - When a user enters their username and password, the system applies the hash function to the username.
- The resulting hash value is used as an index to instantly retrieve the stored password associated with that username from the table.
- The system then compares the retrieved password with the one the user entered.
This is much faster than searching through a list of all users every time someone logs in. Hash maps are used throughout operating systems for tasks that require very fast lookups, such as file system directory lookups and managing kernel objects.
1.9.4 Bitmaps¶
What is a Bitmap?¶
A bitmap (or bit array) is a simple but powerful data structure consisting of a sequence of n binary digits (bits). Each bit, which can be either 0 or 1, is used to represent the status of a corresponding item.
For example, a bitmap can be used to track the availability of n resources:
- Bit value 0: Could mean the resource is available.
- Bit value 1: Could mean the resource is unavailable (or vice versa, the convention is arbitrary).
The position of the bit in the string corresponds to the resource ID. The value of the bit at the i-th position tells you the status of the i-th resource.
Example from the text:
Consider the bitmap: 001011101
- Bit 0: 0 -> Resource 0 is available
- Bit 1: 0 -> Resource 1 is available
- Bit 2: 1 -> Resource 2 is unavailable
- Bit 3: 0 -> Resource 3 is available
- Bit 4: 1 -> Resource 4 is unavailable
- Bit 5: 1 -> Resource 5 is unavailable
- Bit 6: 1 -> Resource 6 is unavailable
- Bit 7: 0 -> Resource 7 is available
- Bit 8: 1 -> Resource 8 is unavailable
The Power of Space Efficiency¶
The primary advantage of a bitmap is its extreme space efficiency. A single bit is the smallest unit of data a computer can address.
- Comparison: If you were to use a Boolean variable (which typically occupies one byte, or 8 bits, in languages like C) to track each resource's status, your data structure would be 8 times larger than a bitmap.
- Significance: This efficiency is critical when you need to track the status of thousands or millions of items. The memory savings are enormous.
Operating System Use Case: Disk Block Management¶
A classic use of bitmaps in operating systems is for free-space management on a disk.
- A disk is divided into many small units called disk blocks.
- A large disk can have millions of these blocks.
- The file system uses a bitmap where each bit corresponds to one disk block.
- Bit value 0: Block is free and available for allocation.
- Bit value 1: Block is allocated to a file and is in use.
When a file needs to be created or extended, the OS can quickly scan the bitmap to find a free block (a '0' bit). When a file is deleted, the OS simply sets the bits corresponding to its blocks back to '0'. This makes allocation and deallocation very fast.
Linux Kernel Data Structures¶
The text provides a helpful note on where to find these data structures in a real-world OS, the Linux kernel. This demonstrates that these are not just theoretical concepts but are used extensively in practice.
- Linked Lists: The implementation is found in the include file
<linux/list.h>. - Queues (kfifo): The implementation for a queue (called a
kfifoin Linux) is in the source filekfifo.c. - Balanced Binary Search Trees: Linux uses red-black trees, and their implementation details are in
<linux/rbtree.h>.
Summary¶
In summary, fundamental data structures like lists, stacks, queues, trees, hash maps, and bitmaps are the building blocks of operating system kernels. They are used to manage processes, memory, files, and all other system resources efficiently. Understanding these structures is key to understanding how the OS itself is implemented.
1.10 Computing Environments¶
This section discusses how operating systems are used in different settings, from offices to homes, and how these environments have evolved over time.
1.10.1 Traditional Computing¶
The concept of "traditional computing" has changed significantly. The clear boundaries that once existed between different types of computer systems have become blurred due to advancements in networking and web technologies.
The Evolution of the Office Environment:
- Past: A typical office used to consist of individual personal computers (PCs) connected to a local network. Specialized computers called servers provided central services like file storage and printing. Remote access to the office network was difficult, and portability was limited to laptop computers that had to be physically carried and connected.
- Present: Modern offices are defined by web technologies and high-speed Wide Area Networks (WANs).
- Portals: Companies now create internal websites (portals) that employees can access securely from anywhere, reducing the reliance on direct connections to internal servers.
- Network Computers / Thin Clients: These are simplified computers that act more like terminals. They rely heavily on a central server to do most of the processing. They are used when easier maintenance or stronger security is needed, as the software and data are centrally managed.
- Mobility: Mobile devices like smartphones and tablets can synchronize with desktop computers and connect directly to company networks via Wi-Fi or cellular data to access the web portal.
The Evolution of the Home Environment:
- Past: Homes typically had one computer with a slow dial-up modem connection to the internet or a remote office.
- Present: High-speed internet is now common and affordable. This has transformed home networks.
- Home computers can now act as servers themselves (e.g., serving web pages or media).
- Homes often have complex networks that include multiple devices like printers, client PCs, and servers.
- Firewalls are essential for home network security. A firewall is a system (often part of a router) that controls the incoming and outgoing network traffic based on security rules, protecting the devices on the network from unauthorized access.
A Note on Historical System Types: In the past, when computing resources were scarce and expensive, operating systems were designed to maximize resource utilization. There were two main types:
- Batch Systems: These processed jobs (like running a program on a large dataset) one after another in a batch. Input was predetermined from files, not interactive users.
- Interactive Systems: These waited for direct input from a user.
To make the most of these expensive machines, time-sharing was developed. Time-sharing systems allow multiple users to interact with the same computer simultaneously. The operating system uses a timer and scheduling algorithms to rapidly switch the CPU between each user's processes. This switching happens so fast that it gives each user the illusion that they have their own dedicated machine.
Time-Sharing Today: While traditional multi-user time-sharing systems are rare, the fundamental technique is still used everywhere. On your personal laptop, the CPU is being time-shared between all the processes you have running—your web browser, your music player, system background tasks, and each individual tab in the browser might be its own process. The operating system gives a small slice of CPU time to each process, creating the experience of multitasking. So, the core idea of time-sharing is now applied to the processes of a single user.
1.10.2 Mobile Computing¶
Mobile computing refers to the use of handheld devices like smartphones and tablets. Their key features are portability and light weight.
Evolution of Mobile Devices:
- Past: Initially, mobile devices sacrificed screen size, memory capacity, and overall power compared to desktops and laptops. This trade-off was made to gain mobile access to basic services like email and web browsing.
- Present: The functionality gap has narrowed significantly. Modern mobile devices are so powerful that it's often hard to tell the difference between a high-end tablet and a laptop. In fact, mobile devices now provide unique functionalities that are either impossible or impractical on traditional computers.
Unique Features and Applications: Mobile devices are used for a vast range of tasks beyond communication, including media consumption (music, video, books) and content creation (photos, HD video recording and editing). Their unique hardware has enabled entirely new application categories:
- GPS (Global Positioning System): An embedded chip that uses satellites to determine the device's exact location on Earth. This is crucial for navigation apps (like Google Maps) and for finding nearby services.
- Accelerometer: Detects the device's orientation relative to the ground and senses motion like tilting and shaking. This allows for intuitive controls in games and is essential for...
- Gyroscope: Works with the accelerometer to provide more precise orientation data. Together, these sensors enable Augmented-Reality (AR) applications, which overlay digital information onto a live view of the real world through the device's camera. It's difficult to imagine such applications on a traditional, non-mobile computer.
Technical Constraints and Connectivity:
- Networking: Mobile devices connect to online services using IEEE 802.11 (Wi-Fi) wireless networks or cellular data networks (4G/5G).
- Hardware Limitations: Despite their power, mobile devices still have more limited storage and processing speed compared to desktop PCs. For example, a smartphone might have 256 GB of storage, while a desktop could have 8 TB. To conserve battery life, mobile processors are often smaller, slower, and have fewer cores than their desktop counterparts.
Dominant Mobile Operating Systems: Two operating systems dominate this space:
- Apple iOS: Designed to run exclusively on Apple's devices like the iPhone and iPad.
- Google Android: An open-source OS that powers smartphones and tablets from many different manufacturers (like Samsung, Google, etc.).
We will examine these OSes in more detail in Chapter 2.
1.10.3 Client-Server Computing¶
This is a fundamental model for organizing networked systems. In a client-server system, the workload is divided between two types of computers:
- Servers: Powerful systems that provide services or resources.
- Clients: Devices (like desktops, laptops, or smartphones) that use those services.
The general structure of this system is shown in Figure 1.22. Clients send requests over a network, and servers respond to those requests.
Categories of Servers:
Server systems can be broadly classified into two types based on the service they provide:
Compute Servers:
- What they do: They provide an interface for clients to request that a specific action or computation be performed.
- How it works: The client sends a request (e.g., "process this data"). The server executes the action and sends the result back to the client.
- Example: A database server. When you search for a product on a website, your browser (the client) sends a query to the website's database server. The server processes the query, finds the matching products, and sends the results back to your browser.
File Servers:
- What they do: They provide a file-system interface, allowing clients to create, read, update, and delete files.
- How it works: The client requests a specific file, and the server sends the entire file over the network.
- Example: A web server is a classic file server. When you visit a webpage, your browser requests the HTML, CSS, and image files from the web server. The server then sends those files to your browser to be displayed. The files can range from simple text to complex multimedia like high-definition video.
1.10.4 Peer-to-Peer Computing¶
Peer-to-Peer (P2P) computing is a different model for building distributed systems. Unlike the client-server model, there is no permanent distinction between clients and servers. Instead, every computer (or "node") in the network is considered a peer. Each peer can act as both a client (requesting a service) and a server (providing a service), depending on the situation.
Advantage over Client-Server: The main advantage of a P2P system is that it eliminates the bottleneck of a central server. In a client-server system, if the server fails or gets overloaded with requests, the entire service can go down or become slow. In a P2P system, services can be provided by many different nodes distributed across the entire network, making the system more robust and scalable.
How Peer-to-Peer Works: Discovering Services For a P2P system to function, a node must first join the network. Once it's part of the network, a key challenge is figuring out which peer offers the service or resource it needs. There are two primary ways this is accomplished:
Centralized Lookup Service:
- How it works: When a node joins the network, it tells a centralized server what services or resources it can provide. This server acts as a directory or index. When a node needs a service, it first contacts this central lookup server to ask, "Who has what I need?" The lookup server responds with the address of the peer that can provide the service. After that, the two peers communicate directly with each other.
- Analogy: This is like a library's central card catalog. You go to the catalog to find which shelf has the book you want, then you go directly to that shelf to get it.
Decentralized Discovery (No Central Server):
- How it works: This method uses no central directory. Instead, a peer that needs a service broadcasts a request to all the other peers it is connected to. The request essentially asks, "Does anyone have this file or service?" If a peer receives the request and can fulfill it, that peer responds directly to the requester. To make this work, the system requires a discovery protocol—a set of rules that allows peers to find each other and advertise their services.
- This scenario is illustrated in Figure 1.23, which shows a peer-to-peer system with no centralized service.
- Analogy: This is like shouting a question in a crowded room. Anyone who knows the answer can shout it back to you.
Historical and Modern Examples: P2P networks became famous in the late 1990s with file-sharing applications:
- Napster: Used a centralized lookup service. A central server maintained an index of all the music files available on users' computers. When you searched for a song, you queried Napster's central server, which told you which user had the file. The actual file transfer then happened directly between your computer and the other user's computer. Napster was shut down due to copyright infringement lawsuits.
- Gnutella: Used a decentralized discovery approach. When you searched for a file, your request was broadcast to other Gnutella users. Those who had the file would respond to you directly. This lack of a central server made it harder to shut down.
A Hybrid Example: Skype Skype (especially its earlier versions) is a good example of a hybrid peer-to-peer system.
- It uses a centralized login server to authenticate users when they first sign in.
- However, once users are logged in, the system tries to establish direct peer-to-peer connections for voice and video calls, as well as for text messaging (using VoIP - Voice over IP technology).
- This hybrid approach combines the convenience of central management (for login) with the efficiency and scalability of direct peer-to-peer communication.
1.10.5 Cloud Computing¶
Cloud Computing is a model for delivering computing resources—like processing power, storage, databases, and even full software applications—as a service over a network (almost always the Internet). Instead of owning and maintaining their own physical computing infrastructure, users can access these resources on-demand, paying only for what they use.
It can be seen as a large-scale extension of virtualization. Cloud providers have massive data centers filled with thousands of physical servers. They use virtualization to create countless Virtual Machines (VMs) on this hardware, which are then allocated to customers as needed.
Types of Cloud Computing: Cloud computing is categorized in several ways. The categories often overlap, and a single cloud environment can provide a combination of them.
1. By Deployment Model (Who can use it?):
- Public Cloud: The cloud infrastructure is owned and operated by a commercial provider (like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud). It is made available to the general public over the Internet. Anyone can sign up and pay for services.
- Private Cloud: The cloud infrastructure is operated solely for a single organization. It may be managed by the organization itself or by a third party, but it exists within the organization's firewall, offering more control and security.
- Hybrid Cloud: A combination of public and private clouds that remain distinct but are connected by technology, allowing data and applications to be shared between them. This gives a business flexibility—for example, running its normal workload on a private cloud but "bursting" out to a public cloud for peak demand periods.
2. By Service Model (What is provided?):
- Software as a Service (SaaS): Delivers full, ready-to-use applications over the Internet. The user doesn't manage the underlying infrastructure or platform; they just use the software. Examples: Gmail, Microsoft Office 365, Salesforce.
- Platform as a Service (PaaS): Provides a platform or environment (including programming languages, databases, web servers, and development tools) that allows customers to develop, run, and manage their own applications without the complexity of building and maintaining the underlying infrastructure. Examples: Google App Engine, Microsoft Azure App Services.
- Infrastructure as a Service (IaaS): Provides the fundamental computing resources: virtual machines, storage, networks, and operating systems. Users have control over the OS and deployed applications but do not manage the underlying cloud infrastructure. Examples: Amazon EC2 (for compute) and Amazon S3 (for storage).
Cloud Management and Architecture: Inside a cloud data center, you will find traditional operating systems running on the physical servers. However, the key software layers that make cloud computing work are:
- Virtual Machine Monitors (VMMs) / Hypervisors: These manage the virtual machines on each physical server.
- Cloud Management Tools: These operate at a higher level, managing the entire pool of resources across all servers. Tools like VMware vCloud Director or open-source options like OpenStack and Eucalyptus orchestrate the VMMs, allocate resources to users, and provide the customer interface. Because these tools manage the fundamental resources of the entire data center, they can be considered a new type of large-scale, distributed operating system.
Figure 1.24 illustrates the architecture of a public cloud offering IaaS.
- Customer Requests: Users send requests over the Internet.
- Firewall: Both the cloud services and the management interface are protected by a firewall to ensure security.
- Customer Interface & Cloud Management: The request goes through a customer interface, which is managed by the cloud management services. A load balancer distributes incoming requests across multiple servers to avoid overloading any single one.
- Infrastructure: The management system then provisions the required resources from the pools of servers, storage, and virtual machines.
1.10.6 Real-Time Embedded Systems¶
Embedded computers are the most common type of computer in the world. They are specialized devices designed to perform specific, dedicated tasks and are built into larger systems. You find them in car engines, medical devices, industrial robots, microwave ovens, and TV remotes.
Key Characteristics of Embedded Systems:
- Specific Task: They are designed for a single purpose or a limited set of functions.
- Primitive OS: The operating systems they run are often very simple and lightweight, providing only the essential features needed for the task.
- Limited or No User Interface (UI): They typically don't have a complex UI like a desktop computer. Instead, they spend most of their time monitoring sensors and controlling hardware directly.
Variations in Embedded Systems: Not all embedded systems are the same. They exist on a spectrum of complexity:
- General-Purpose Computer with Special Software: Some are essentially standard computers running a general-purpose OS (like a stripped-down version of Linux) with a custom application that performs the specific task (e.g., a kiosk or a smart TV).
- Dedicated Hardware with an Embedded OS: Others are custom hardware devices that run a special-purpose embedded operating system built solely for that device's function.
- Application-Specific Integrated Circuit (ASIC): The simplest forms are hardware chips (ASICs) programmed to perform a specific function without needing any operating system at all.
Expanding Role: The "Smart" World The use of embedded systems is growing rapidly, especially with the rise of the "Internet of Things" (IoT). Entire homes can be automated, with a central computer (which could be an embedded system itself) controlling heating, lighting, and appliances. Web connectivity allows for remote control, like telling your house to turn up the heat before you arrive home. Future possibilities include a refrigerator that can automatically order milk when it senses you are running low.
The Critical Link: Real-Time Operating Systems¶
Most embedded systems require a Real-Time Operating System (RTOS). A real-time system is critical when there are rigid, well-defined time constraints for processing data and responding to events. They are primarily used as control devices.
How it Works: Sensors (like a thermometer or a motion detector) send data to the computer. The computer must analyze this data and, if necessary, send a command to an actuator (like a valve or a motor) within a strict deadline. Examples include:
- Medical imaging systems (e.g., a CT scanner)
- Industrial control systems (e.g., controlling a robotic arm on an assembly line)
- Automotive systems (e.g., engine fuel injection, anti-lock brakes)
- Avionics and weapon systems
The Definition of "Failure" in Real-Time Systems: In a real-time system, correctness depends not only on the right answer but also on the time taken to produce it.
- A result that is correct but delivered too late is a system failure.
- Example: If a robot arm is building a car and receives a "halt" signal because something is wrong, the system fails if the arm does not stop before it crashes into the car. A delay of a few milliseconds could be catastrophic.
This is a fundamental difference from a general-purpose system like your laptop. On your laptop, it is desirable for the system to respond quickly, but a slight delay is usually acceptable. In a hard real-time system, a delay is not acceptable.
We will explore the scheduling algorithms that make real-time operation possible in Chapter 5, and look at the real-time features of the Linux kernel in Chapter 20.
1.11 Free and Open-Source Operating Systems¶
The ability to study operating systems is greatly enhanced by the availability of free and open-source software. Both types provide the source code of the operating system, which is the human-readable instructions written by programmers. This is in contrast to the compiled binary code, which is the machine-readable format that the computer actually executes.
It is important to understand that "free software" and "open-source software" are distinct concepts, championed by different communities.
- Free Software (or Free/Libre Software): The "free" here refers to freedom, not just price. Free software is defined by its licensing, which guarantees users four essential freedoms:
- The freedom to run the program for any purpose.
- The freedom to study how the program works and change it (which requires access to the source code).
- The freedom to redistribute copies.
- The freedom to distribute copies of your modified versions to others.
- Open-Source Software: This term focuses more on the practical benefits of having access to the source code, such as improved collaboration and security. While it requires the source code to be available, its licenses may not grant all the freedoms associated with "free software."
The Key Difference: Therefore, all free software is open source, but not all open-source software is "free" in the libre sense. Some open-source licenses may have restrictions that conflict with the four freedoms.
Examples of Operating System Models:
- GNU/Linux: The most famous example. It is an open-source operating system, and many of its distributions (like Ubuntu) are free software. However, some distributions may include proprietary components.
- Microsoft Windows: A classic example of closed-source or proprietary software. Microsoft owns the code, restricts its use, and keeps the source code secret.
- Apple macOS: A hybrid approach. Its core, a kernel named Darwin, is open-source. However, the user interface and many key components are proprietary and closed-source.
Why Source Code Matters for Learning¶
Having the source code is a powerful learning tool for several reasons:
- Transparency: You can see exactly how the system works, from low-level scheduling to high-level system calls.
- Modification and Experimentation: A student can modify the source code, recompile it, and run the modified OS to see the effects of their changes. This is an excellent way to understand complex algorithms.
- No Reverse Engineering Needed: Reverse engineering binary code to understand functionality is extremely difficult and time-consuming. Source code provides the complete picture, including programmer comments.
This textbook will use high-level descriptions of algorithms but will also include pointers to open-source code for deeper study and projects that involve modifying OS source code.
Benefits of the Open-Source Model¶
The open-source development model offers significant advantages:
- Community Development: A global community of interested programmers can contribute by writing, debugging, analyzing, and improving the code. Many of these contributors are volunteers.
- Security ("Linus's Law"): The principle that "given enough eyeballs, all bugs are shallow" suggests that open-source code can be more secure because more people are examining it for vulnerabilities. While bugs exist, they are often found and fixed more quickly than in closed-source systems.
- Commercial Viability: Companies like Red Hat have built successful business models around open-source software by selling support, customization, and integration services, rather than just the software license itself.
1.11.1 History¶
The relationship between software and its source code has evolved significantly.
1950s-1970s: The Era of Sharing In the early days, software was commonly distributed with its source code. Computer enthusiasts and user groups freely shared and modified code. For example, Digital Equipment Corporation (DEC) distributed its operating systems as source code without restrictive copyrights.
1980s: The Shift to Proprietary Software As the software industry grew, companies began to see software as a primary product. To protect their intellectual property and generate revenue, they started distributing only the compiled binary files, keeping the source code secret. This created proprietary software. By the 1980s, this closed-source model had become the norm, even for operating systems on hobbyist computers.
1.11.2 Free Operating Systems¶
In response to the growing trend of proprietary software, Richard Stallman launched a movement in 1984 to create a free, UNIX-compatible operating system called GNU (a recursive acronym for "GNU's Not Unix!").
The Philosophy of "Free Software": For Stallman, "free" refers to freedom, not price. The movement does not oppose selling software, but insists that users must have four essential freedoms:
- The freedom to run the program for any purpose.
- The freedom to study how the program works and adapt it to your needs (which requires access to the source code).
- The freedom to redistribute copies so you can help your neighbor.
- The freedom to improve the program and release your improvements to the public, so that the whole community benefits.
In 1985, Stallman published the GNU Manifesto, outlining this philosophy, and founded the Free Software Foundation (FSF) to promote the development and use of free software.
Copyleft and the GNU General Public License (GPL): To legally protect these freedoms, the FSF uses a concept called copyleft, which uses copyright law to achieve the opposite of its usual purpose. Instead of restricting use, copyleft ensures the software remains free.
- The GNU General Public License (GPL) is the most well-known copyleft license.
- It grants the four freedoms but with a crucial condition: if you redistribute the program, or a modified version of it, you must do so under the same GPL license. This "share-alike" clause prevents anyone from taking free code, modifying it, and turning it into a proprietary product. The source code must always be available.
- This concept is similar to the Creative Commons "Attribution-ShareAlike" license.
1.11.3 GNU/Linux¶
GNU/Linux is the prime example of a successful free and open-source operating system. Its development is a story of two projects merging.
The Genesis: GNU and the Linux Kernel
- By 1991, the GNU Project had developed almost all the components of a complete operating system (compilers, editors, utilities, libraries) except for one critical part: a working kernel (the core of the OS that manages hardware).
- In 1991, Linus Torvalds, a Finnish student, wrote a rudimentary UNIX-like kernel and released it on the internet. He used the GNU development tools to build it.
- Leveraging the internet, thousands of programmers worldwide began contributing to Torvalds' kernel, which became known as Linux.
- Initially, Linux had a non-commercial license. In 1992, Torvalds re-released it under the GPL, making it free software and allowing it to be combined with the GNU system.
The Result: Distributions The combination of the Linux kernel and the GNU utilities created the complete GNU/Linux operating system. This has led to the creation of hundreds of different distributions (or "distros"), which are custom-built versions of the system. They vary in their target audience, pre-installed software, user interface, and support. Major distributions include:
- Red Hat Enterprise Linux: For commercial, enterprise use.
- Ubuntu: A popular, user-friendly distribution for desktops and servers.
- Debian: A community-driven distribution known for stability.
- Specialized Distros: Some are designed for specific purposes. For example, PCLinuxOS is a live CD/DVD—an OS that can be booted directly from a disc or USB drive without installing it on the computer's hard drive. A variant like PCLinuxOS Supergamer DVD comes pre-loaded with games and drivers.
How to Run Linux for Study¶
The text recommends an easy way to run Linux alongside your current operating system using virtualization:
- Download a Virtual Machine Monitor (VMM): Install a free tool like VirtualBox from https://www.virtualbox.org/. This software allows you to run an entire operating system as a "guest" within a window on your "host" OS.
- Get a Linux Image: You can either:
- Install an OS from scratch using an installation CD image.
- Download a pre-built virtual machine image from a site like http://virtualboxes.org/images/, which comes with an OS and applications already installed.
- Boot the Virtual Machine: Start the virtual machine within VirtualBox, and you will have a full Linux system running on your computer.
An alternative to VirtualBox is Qemu, which includes tools for converting VirtualBox images.
This textbook provides a virtual machine image of GNU/Linux running Ubuntu. This image contains the Linux source code and development tools. We will use this environment for examples and a detailed case study in Chapter 20.
1.11.4 BSD UNIX¶
BSD UNIX has a longer and more complex history than Linux. It originated in 1978 as a set of modifications and enhancements to AT&T's original UNIX operating system, developed at the University of California at Berkeley (UCB).
Key Historical Points:
- Not Initially Open Source: Early BSD releases included source code, but they were not considered "open source" in the modern sense because they required a license from AT&T, the original owner of UNIX.
- Legal Hurdles: The development of BSD was significantly delayed by a lawsuit from AT&T over intellectual property. This legal battle was a major catalyst for the creation of completely free, AT&T-independent UNIX-like systems.
- Resolution: The lawsuit was eventually settled, leading to the release of a fully functional, truly open-source version called 4.4BSD-lite in 1994. This release is the foundational ancestor of modern BSD systems.
Modern BSD Distributions: Similar to Linux, there are several distributions (or "flavors") of BSD, each with a slightly different focus:
- FreeBSD: Focuses on performance and ease of use on standard PC hardware.
- NetBSD: Emphasizes portability, running on a vast array of hardware platforms.
- OpenBSD: Prioritizes security and code correctness.
- DragonflyBSD: Explores novel approaches to multiprocessing.
How to Study BSD Source Code: The process is very similar to studying Linux:
- Download a Virtual Machine Image: You can download a pre-configured FreeBSD virtual machine image and run it using a virtual machine manager like Virtualbox (as described previously for Linux).
- Locate the Source Code: The entire operating system source code is included with the distribution and is stored in the directory
/usr/src/. - Find the Kernel Code: The kernel source code is located in
/usr/src/sys/. For example, to study the virtual memory implementation, you would examine the files in the/usr/src/sys/vm/directory. - Online Browsing: Alternatively, you can browse the source code online via the FreeBSD project's website: https://svnweb.freebsd.org.
Version Control Systems: The BSD project, like most large open-source projects, uses a Version Control System (VCS) to manage changes to the source code. BSD uses Subversion (SVN).
- Purpose of a VCS: These systems allow developers to "pull" the latest code to their computer, make changes, and "push" those changes back to a central repository. They also keep a complete history of every file, manage contributions from multiple developers, and help resolve conflicts.
- Other VCS: Another extremely popular version control system is git, which is used to manage the Linux kernel source code and many other projects.
macOS and Darwin: The core of Apple's macOS, called Darwin, is based on BSD UNIX. Darwin itself is open-source. Its source code is available from http://www.opensource.apple.com/, with each macOS release having its corresponding open-source components posted. The macOS kernel package begins with "xnu". Apple also provides extensive developer resources at http://developer.apple.com.
THE STUDY OF OPERATING SYSTEMS¶
We are in a golden age for studying operating systems. The barriers to entry have never been lower, thanks to two major developments:
1. The Open-Source Movement:
- Access to Code: Major operating systems like Linux, BSD UNIX, Solaris, and parts of macOS are available in both source and binary form. This allows us to move beyond just reading descriptions and to see how things actually work by examining the code itself.
- Historical Systems: Even older, commercially obsolete operating systems have been open-sourced, allowing students to study the design constraints and solutions from eras with limited CPU, memory, and storage. A large list of open-source OS projects is available online.
2. The Rise of Virtualization:
- Easy Experimentation: Free and widely available virtualization software like VMware Player and Virtualbox allows you to run hundreds of different operating systems as "virtual appliances" on a single physical machine. You can test, experiment, and even break an OS inside a virtual machine without affecting your main system or needing dedicated hardware.
- Hardware Simulation: For truly historical study, simulators exist for old hardware (like the DECSYSTEM-20). This allows you to run an original operating system like TOPS-20, complete with its original source code, on a modern machine.
From Student to Developer: This open environment makes the transition from student to contributor or even creator possible. With dedication and an internet connection, a student can download source code, modify it, and create their own operating system distribution. Access to knowledge and tools is now limited only by a student's interest and effort, not by proprietary restrictions.
1.11.5 Solaris¶
Solaris is the UNIX-based operating system developed by Sun Microsystems (now owned by Oracle). Its history reflects the evolution of the UNIX family:
- Origins: Sun's original operating system, SunOS, was based on BSD UNIX.
- Transition: In 1991, Sun shifted its base from BSD to AT&T's System V UNIX, which led to the renaming of the operating system to Solaris.
- Open-Sourcing: In 2005, Sun open-sourced most of the Solaris code under the name OpenSolaris. This move was significant as it provided access to the source code of a mature, enterprise-level commercial UNIX system.
- Oracle Acquisition: Oracle's purchase of Sun in 2009 created uncertainty about the future of the OpenSolaris project. Oracle ultimately discontinued the open-source model for the main Solaris product.
The Illumos Project: The community that had grown around OpenSolaris continued its development independently. This effort coalesced into Project Illumos.
- Purpose: Illumos is a community-driven, open-source fork of the OpenSolaris codebase. It has expanded beyond the original OpenSolaris base to include new features and improvements.
- Role: Illumos now serves as the core for several derivative operating system distributions, keeping the OpenSolaris lineage alive. You can find more information at http://wiki.illumos.org.
1.11.6 Open-Source Systems as Learning Tools¶
The open-source movement has created an unprecedented opportunity for students to learn about operating systems deeply and practically.
Direct Hands-On Learning:
- Examination and Modification: Students can read the source code of mature, full-featured operating systems to understand how algorithms are implemented in real-world scenarios. They can then modify this code, compile it, and test their changes to see the direct effects.
- Community Participation: Students can contribute to real projects by helping to find and fix bugs (debugging), which is an invaluable skill. This provides practical experience far beyond theoretical study.
Access to Historical Context: The availability of source code for historic systems, like Multics, allows students to understand the design decisions and constraints of earlier eras of computing. This historical knowledge provides a stronger foundation for understanding modern systems and implementing new projects.
Diversity of Systems: A major advantage is the ability to compare different systems. For example:
- GNU/Linux and BSD UNIX are both open-source, but they have different histories, goals, licensing terms, and design philosophies.
- This diversity allows students to see multiple solutions to the same fundamental problems (e.g., process scheduling, memory management).
Cross-Pollination and Innovation: Open-source licenses often allow code to be shared between projects. This leads to cross-pollination, where the best features from one system are incorporated into another. For example, major components from OpenSolaris, such as its advanced filesystem (ZFS) and debugging tools (DTrace), have been ported to BSD-based systems and Linux. This sharing accelerates innovation and improvement across all open-source projects.
The benefits of open-source software are likely to continue driving an increase in the number, quality, and adoption of these projects by both individuals and companies.
1.12 Summary¶
This chapter introduced the fundamental concepts of operating systems. Here is a summary of the key points:
Operating System Definition: An operating system is the software that acts as an intermediary between the computer hardware and application programs. It manages the hardware and provides an environment for programs to run.
Interrupts: These are crucial signals from hardware devices to the CPU, alerting it that an event requires attention (e.g., a key is pressed, a disk read is complete). The operating system uses an interrupt handler to manage these events.
Main Memory (RAM): This is the primary, volatile storage that the CPU can access directly. Programs must be loaded into main memory to be executed. Its contents are lost when power is turned off.
Storage Hierarchy: Computer storage is organized in a hierarchy based on speed and cost.
- Top (Fast/Expensive): CPU registers, cache.
- Middle: Main Memory (RAM).
- Bottom (Slow/Inexpensive): Nonvolatile storage like hard disks, which provide permanent, high-capacity storage for programs and data.
Multiprocessor Systems: Modern computers contain multiple processors (CPUs), and each CPU often contains multiple computing cores, allowing true parallel execution.
CPU Management:
- Multiprogramming: This technique keeps several jobs (processes) in memory at once. If one job waits for I/O, the CPU can switch to another job, ensuring the CPU is always busy.
- Multitasking (Time-Sharing): An extension of multiprogramming that rapidly switches the CPU between processes, providing users with a fast, interactive response time.
Dual-Mode Operation: To protect the system, hardware supports two modes:
- User Mode: Where user applications run. Access to certain instructions and memory is restricted.
- Kernel Mode: Where the operating system runs. It has unrestricted access to all hardware instructions, including privileged instructions for I/O control, timer management, and interrupt handling.
Process Management: A process is the fundamental unit of work. The OS is responsible for creating, deleting, and managing processes, including enabling them to communicate and synchronize with each other.
Memory Management: The OS keeps track of memory usage, allocates memory to processes when they need it, and frees it when they are done.
Storage Management: The OS manages disk space and provides a file system—a way to store, organize, and retrieve files and directories on storage devices.
Protection and Security: The OS provides mechanisms to control access to resources (protection) and to defend the system from external and internal threats (security).
Virtualization: This technology involves abstracting physical hardware to create multiple, isolated execution environments (virtual machines) on a single physical machine.
Data Structures: Operating systems rely on fundamental data structures like lists, stacks, queues, trees, and maps to manage information efficiently.
Computing Environments: Operating systems are used in various settings:
- Traditional Computing (evolving office and home PCs)
- Mobile Computing (smartphones, tablets)
- Client-Server Computing
- Peer-to-Peer (P2P) Computing
- Cloud Computing (IaaS, PaaS, SaaS)
- Real-Time Embedded Systems
Free and Open-Source Operating Systems:
- These systems provide their source code, which is a powerful learning tool.
- Free Software emphasizes user freedoms (use, study, modify, redistribute).
- Open-Source Software focuses on the practical benefits of collaborative development.
- Examples include GNU/Linux, FreeBSD, and OpenSolaris.
Chapter 2: Operating-System Structures¶
Introduction to Operating System Services¶
An operating system (OS) is the fundamental software that provides an environment for programs to run. Think of it as the manager of your computer's hardware, making it easier and safer for you and your programs to use the CPU, memory, disks, and other devices.
Since operating systems can be built in many different ways, having clear design goals is crucial before starting to build one. These goals guide the choice of algorithms and strategies used inside the OS.
We can look at an operating system from three main angles:
- Services it provides: What does it do for users and programs?
- Interface it offers: How do users and programmers interact with it?
- Internal components: What are its pieces and how do they connect?
This chapter will explore all three of these viewpoints.
2.1 Operating-System Services¶
Introduction: The OS as a Service Provider¶
An operating system's core job is to provide a stable and functional environment where programs can run. It achieves this by offering a set of essential services to both the programs themselves and the users who run them. While the exact list of services can vary between different operating systems (like Windows, Linux, or macOS), we can group them into common categories.
Go to Figure 2.1: This figure gives a visual overview of these services. It shows how user requests, made via a GUI or command line, are translated into system calls that interact with the operating system. The OS then uses the hardware to provide these services. The services can be broadly divided into those that help the user and those that ensure the system itself runs efficiently.
Part 1: Services for the User¶
These are the functions that you, as a user or programmer, interact with directly or indirectly to get your work done.
1. User Interface (UI)¶
This is how you interact with the OS. There are three primary types:
- Graphical User Interface (GUI): The modern, visual interface most people are familiar with. It uses windows, icons, menus, and a pointer (like a mouse). You click to direct input/output and make selections.
- Touch-Screen Interface: A variant of the GUI common on mobile devices. You use gestures like sliding and tapping on the screen to interact.
- Command-Line Interface (CLI): A text-based interface where you type specific commands with options. It's powerful for automation and precise control. Many systems offer a choice between these interfaces.
2. Program Execution¶
The OS must be able to load a program's instructions into main memory and start its execution. This involves the CPU fetching and executing the instructions. Just as importantly, the OS must be able to clean up after a program ends, whether it finishes normally or crashes abnormally (e.g., due to an error).
3. I/O Operations¶
Running programs almost always need to perform Input/Output operations, like reading from a file or printing to a printer. For efficiency and, crucially, for protection, user programs are not allowed to directly control I/O devices. Imagine if any program could directly read from your disk; malware would be trivial to write! Therefore, the OS provides a safe, controlled means to perform all I/O.
4. File-System Manipulation¶
Programs need to work with data stored long-term. The OS manages the file system, allowing programs to:
- Create and delete files and directories.
- Search for files.
- Read from and write to files.
- List file information (metadata).
- Manage permissions to control which users can access which files.
5. Communications¶
Often, one process needs to talk to another. This could be between two processes on the same computer or between processes on different computers connected by a network. The OS provides mechanisms for this communication, primarily through two methods:
- Shared Memory: A block of memory is created that multiple processes can both read from and write to.
- Message Passing: Processes send and receive discrete packets of information (messages) through the OS, which handles the transfer.
6. Error Detection¶
The OS must be constantly vigilant for errors. These can occur in:
- Hardware: Memory errors, power failure.
- I/O Devices: Network connection failure, printer out of paper.
- User Program: Arithmetic overflow, attempting to access an illegal memory address. The OS must detect these errors and take appropriate action, which could be terminating the offending program, retrying the operation, or in severe cases, halting the entire system to prevent corruption.
Part 2: Services for Ensuring Efficient System Operation¶
These functions operate behind the scenes. The user doesn't directly invoke them, but they are critical for the stability, security, and performance of the entire system, especially when multiple processes are running.
1. Resource Allocation¶
When multiple programs or jobs are running concurrently, they must share the finite resources of the computer. The OS acts as a resource manager, allocating resources like:
- CPU Time: Using CPU-scheduling algorithms to decide which process runs next.
- Main Memory: Deciding how much memory to give to each process.
- File Storage: Managing disk space.
- I/O Devices: Allocating devices like printers and USB drives.
2. Logging (Accounting)¶
The OS keeps track of which users and programs use how many and what types of resources. This data can be used for:
- Billing: In large, shared systems where users are charged for their usage.
- Usage Statistics: System administrators use this data to analyze performance, spot trends, and reconfigure the system for better efficiency.
3. Protection and Security¶
In a multi-user or networked environment, it is vital to control access to information and prevent processes from interfering with each other or the OS itself.
- Protection: This deals with internal control. It ensures that each process can only access the resources (memory, files, CPU) that it is authorized to use. This prevents a buggy or malicious program from crashing the entire system.
- Security: This defends against external threats. It starts with authentication (like requiring a password to log in) and extends to securing network connections and defending against unauthorized access attempts from outside the system. The text makes a key point: security must be comprehensive. A single weak point, like a poor password or an unpatched network service, can compromise the entire system, just as a chain is only as strong as its weakest link.
2.2 User and Operating-System Interface¶
An operating system provides ways for users to tell it what to do. We will now explore the two main categories of user interfaces in more detail: the Command-Line Interface and the Graphical User Interface.
2.2.1 Command Interpreters (Shells)¶
Think of the command interpreter as a bridge between you and the operating system's core. It's a special program that reads and carries out your commands.
- On systems like Linux, UNIX, and Windows, this program starts up as soon as you log in.
- When you have a choice between different command interpreters, they are often called shells. For example, on Linux and UNIX, you can choose from shells like the C shell, Bourne-Again shell (bash), and Korn shell. They all do the same basic job, so the choice is about personal preference for features and syntax.
Go to Figure 2.2: This figure shows the Bourne-Again shell (bash) in action on a macOS terminal window. You type a command at the prompt, and the shell executes it.
The Shell's Main Job¶
The primary function of a shell is simple:
- Get the next command from the user.
- Execute that command.
Many of these commands are for file manipulation (create, delete, list, copy, etc.). How the shell actually executes these commands can be implemented in one of two ways:
Two Ways to Implement Commands¶
1. The Built-in Command Approach
- In this method, the command interpreter (the shell program itself) contains the actual code to perform the command.
- For example, when you type a
deletecommand, the shell has a specific section of its own code that it runs to handle file deletion. This code then makes the necessary system call to the OS. - Disadvantage: The shell program's size is directly tied to the number of commands it supports. Adding a new command requires modifying and re-releasing the shell itself.
2. The System Program Approach (Used by UNIX/Linux)
- This is a more modular and powerful approach. Here, the shell itself does not contain the code for commands like
lsorrm. - Instead, the shell's job is to find a program file with the same name as the command and execute it.
- Example: When you type
rm file.txt, the shell:- Searches for a file named
rmin a set of predefined directories (known as thePATH). - Loads this
rmfile into memory. - Executes it, passing the argument
file.txtto it.
- Searches for a file named
- The actual logic for deleting the file is completely defined within the separate
rmprogram. - Advantage: This system is highly extensible. To add a new command, you just need to create a new program and place it in the right directory. The shell can remain small and simple, and it doesn't need to be changed to support new commands.
2.2.2 Graphical User Interface (GUI)¶
The second primary strategy for interacting with an operating system is through a Graphical User Interface (GUI). This is a visual, user-friendly alternative to the text-based command line.
Core Concepts of a GUI¶
Instead of typing commands, users interact with a mouse-based window-and-menu system built around a desktop metaphor. Think of your screen as a physical desk. On it, you see:
- Icons: Small images that represent programs, files, directories (called folders in a GUI), and system functions.
- Windows: Rectangular areas that display the contents of an application or a folder.
- Menus: Lists of commands that appear when you click on a title or button.
You control a pointer on the screen using a mouse (or touchpad). By moving the pointer and clicking the mouse buttons, you can:
- Invoke a program by double-clicking its icon.
- Select a file or folder by clicking on it.
- Pull down a menu to see and select from a list of commands.
Historical Context¶
The GUI has a rich history:
- It was pioneered in the early 1970s at the Xerox PARC research facility.
- The first computer to feature a GUI was the Xerox Alto in 1973.
- GUIs became mainstream in the 1980s with the Apple Macintosh.
- Microsoft introduced its GUI as Windows, initially as an addition to the MS-DOS operating system.
GUIs in the UNIX/Linux World¶
While UNIX systems were traditionally command-line dominated, they have powerful GUI options available, largely developed by the open-source community. Two major examples are:
- KDE (K Desktop Environment)
- GNOME (GNU Network Object Model Environment)
These desktop environments run on Linux and other UNIX-like systems and are available under open-source licenses, meaning their source code can be freely used, modified, and distributed.
2.2.3 Touch-Screen Interface¶
For mobile devices like smartphones and tablets, using a mouse or typing long commands is impractical. These devices primarily use a Touch-Screen Interface.
How it Works¶
Users interact directly with the screen using their fingers. Interaction is based on gestures, such as:
- Pressing (Tapping): To select an item, like an app icon or a button.
- Swiping: To scroll through a page, switch between screens, or reveal menus.
- Pinching and Spreading: To zoom in and out.
While early smartphones had physical keyboards, most modern devices simulate a keyboard directly on the touch screen when text input is needed.
Go to Figure 2.3: This figure shows the touch-screen interface of an Apple iPhone, illustrating how apps and functions are controlled via direct touch interaction.
2.2.4 Choice of Interface¶
The decision to use a command-line interface (CLI) or a graphical user interface (GUI) is largely a matter of personal preference and the specific task at hand. Different interfaces offer different advantages for different types of users.
Who Uses the Command-Line Interface (CLI) and Why?¶
The primary users of the command line are system administrators and power users. For them, the CLI is often more efficient because it provides:
- Faster Access: It can be quicker to type a command than to navigate through multiple menus with a mouse.
- Access to More Functions: On some systems, advanced or less common system functions are only available via the command line.
- Powerful Automation: This is a key advantage. The CLI is highly programmable. Repetitive tasks involving multiple commands can be saved in a text file called a shell script. This script can then be run as a program. The command-line interpreter reads and executes the commands in the script one by one. This use of scripts is extremely common on UNIX and Linux systems.
Who Uses the Graphical User Interface (GUI) and Why?¶
The vast majority of users, especially on Windows and macOS, primarily use the GUI. It is more intuitive and visually oriented.
- Windows: Most users rely entirely on the Windows GUI. Modern Windows systems provide a standard GUI for desktops and a touch-screen variant for tablets.
- macOS: The history of Macintosh offers an interesting case study. Historically, Mac OS only provided a GUI. However, with the modern macOS (which is built on a UNIX kernel), the system now provides both the Aqua GUI and a full, powerful command-line interface (the Terminal app), giving users the best of both worlds.
Go to Figure 2.4: This figure shows the macOS GUI, known as Aqua, with its characteristic dock and menu bar.
Mobile Systems and Their Interface¶
On mobile systems like iOS and Android, the touch-screen interface is dominant. While command-line apps exist, they are used very rarely. Almost all user interaction is done through gestures on the touch screen.
A Key Distinction: The User Interface vs. The Operating System Core¶
It is crucial to understand that the user interface (whether CLI, GUI, or touch) is separate from the core structure of the operating system. The design of a user-friendly interface is a distinct challenge from solving the fundamental problems of operating system design, such as process scheduling or memory management.
From the operating system's perspective, a user clicking an icon in a GUI and a user running a script from the command line are ultimately doing the same thing: requesting a service. The OS does not fundamentally distinguish between a "user program" and a "system program"; it provides services to all programs that request them properly. This book focuses on these core services and how the OS provides them.
2.3 System Calls¶
Introduction: The Programmer's Gateway to the OS¶
System calls are the fundamental interface between a running application program and the operating system. They are the mechanism that a program uses to request a service from the kernel, such as reading a file or creating a process.
These calls are typically implemented as functions written in high-level languages like C and C++, which makes them accessible to application programmers. However, for tasks that require the absolute highest performance or direct hardware manipulation, they may need to be implemented using assembly language instructions.
2.3.1 Example: A File Copy Program¶
To understand how system calls are used, let's trace through a simple program that reads data from one file and copies it to another. This example will show that even a simple task requires a sequence of many system calls.
Go to Figure 2.5: This flowchart provides an excellent visual guide for the sequence of system calls described below.
Step 1: Getting the File Names¶
The program first needs to know which file to read from (in.txt) and which to write to (out.txt). There are two common ways to do this, both involving system calls:
- Command-Line Arguments: The file names are passed as arguments when the program is run (e.g.,
cp in.txt out.txt). The shell has already handled the system calls to parse this command. - Interactive Input: The program interactively asks the user for the names. This requires:
- A system call to write a prompting message to the screen.
- A system call to read the user's input from the keyboard.
- In a GUI, this involves many more I/O system calls to display windows, handle mouse clicks, etc.
Step 2: Setting Up the Files¶
Once the program has the file names, it must prepare the files for operation:
- Open the input file: A system call to
open("in.txt").- Error Handling: This call can fail. The OS returns an error if the file doesn't exist or if the program doesn't have permission to read it. The program must then make system calls to write an error message and terminate abnormally.
- Create and open the output file: A system call to
creat("out.txt")oropen()with specific flags.- Error Handling: If the output file already exists, the program must decide what to do. It might:
- Abort the operation (a system call).
- Delete the existing file (a system call) and then create a new one (another system call).
- Ask the user what to do (using more write and read system calls).
- Error Handling: If the output file already exists, the program must decide what to do. It might:
Step 3: The Copy Loop¶
With both files successfully opened, the program enters the main copy loop:
- Read from input: A system call to
read()a chunk of data from the input file into memory.- This call returns status information. It will eventually indicate when the end of the file is reached. It could also report a hardware error.
- Write to output: A system call to
write()that same chunk of data from memory to the output file.- This call can also fail, for example, if the disk runs out of space.
This read-write cycle repeats until the entire file has been copied.
Step 4: Cleanup and Termination¶
After the loop finishes, the program must clean up and exit cleanly:
- Close both files: System calls to
close()the input and output files. This tells the OS that the program is done with them, ensuring all data is physically written to disk. - Notify the user: A system call to write a "copy complete" message to the screen.
- Exit: A final system call to terminate normally.
As you can see, a seemingly simple command like cp is built upon a long and carefully managed sequence of system calls, each one a request for the operating system to perform a specific, protected action.
2.3.2 Application Programming Interface (API)¶
Bridging the Gap for Programmers¶
As we saw with the file copy example, programs rely heavily on the OS. However, most application programmers do not work with raw system calls directly. Instead, they use an Application Programming Interface (API).
An API is a set of well-defined functions, including their parameters and return values, that a programmer can use to build applications. Think of it as a standard contract between the programmer and the operating system.
Common APIs include:
- Windows API: For the Windows operating system.
- POSIX API: For POSIX-based systems like UNIX, Linux, and macOS.
- Java API: For programs that run on the Java Virtual Machine (JVM).
To use these functions, a programmer includes a library provided by the OS, such as libc for C programs on UNIX/Linux.
The Relationship Between an API and System Calls¶
The functions in an API are typically wrappers that invoke the actual system calls.
- Example: A Windows programmer calls the API function
CreateProcess(). Behind the scenes, this function invokes the actual system call in the Windows kernel, which might be calledNTCreateProcess().
Why use an API instead of direct system calls?
- Portability: Code written using a standard API (like POSIX) is more likely to compile and run on different systems that support the same API. Direct system calls are often unique to a specific OS.
- Ease of Use: APIs are generally simpler and more convenient for programmers than the often more complex and detailed raw system calls.
Example: The read() API in UNIX/Linux¶
Let's examine a real-world API. The command man read on a UNIX system reveals the API for the read() function:
#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
- Return Value:
ssize_t- The number of bytes read.0means end-of-file,-1means an error occurred. - Parameters:
int fd: The file descriptor (a number representing the open file).void *buf: A pointer to the memory buffer where the read data should be stored.size_t count: The maximum number of bytes to read.
This is the clean, standard interface a programmer works with.
The System-Call Interface: The Real Gateway¶
The run-time environment (RTE), which includes libraries and loaders, provides a crucial layer called the system-call interface.
How it works (Refer to Figure 2.6):
- A user application calls an API function like
open(). - The system-call interface intercepts this call.
- Each system call has a unique number. The interface uses this number to look up the address of the corresponding kernel function in a table.
- The interface then triggers a switch from user mode to kernel mode and invokes the actual
open()system call code within the operating system. - After the system call completes, the result is returned to the application.
The programmer doesn't need to know any of these implementation details; they only need to use the API correctly.
Passing Parameters to System Calls¶
System calls often require parameters (e.g., which file to open). Since a system call involves a privileged switch to kernel mode, these parameters must be passed safely. There are three common methods:
- Pass in Registers: The simplest method. Parameters are placed in the CPU's registers.
- Pass in a Block (Table): If there are more parameters than available registers, they are stored in a block of memory. The address of this block is then passed in a single register.
- Linux Example: Linux uses registers if there are 5 or fewer parameters; if there are more, it uses the block method.
- Pass on the Stack: Parameters are pushed onto the program's stack. The operating system then pops them off the stack. Some operating systems prefer the block or stack method because those approaches do not limit the number or length of parameters being passed.
Go to Figure 2.7: This figure illustrates the block method. The user program places parameters in a table in memory (X) and then passes the address of that table in a register. The operating system then knows to look in that memory block to retrieve all the parameters for the system call.
2.3.3 Types of System Calls¶
Think of system calls as the official "request lines" that your application programs use to ask the operating system kernel for services. Since the kernel has supreme control over hardware and critical resources, user programs can't do these tasks directly. They must make a system call.
System calls are generally grouped into six major categories. Figure 2.8 provides a concise summary of these categories and their common calls. We will now explore each one in detail.
2.3.3.1 Process Control¶
This category deals with the management of processes—which are basically programs in execution. If you remember from computer architecture, a process is more than just the code; it's the code, data, and the state of the CPU (registers, program counter, etc.).
Key System Calls and Their Purposes:
end()andabort(): A process needs ways to stop.end()is for a normal, happy ending.abort()is for when something goes wrong. When this happens, the OS might create a core dump—a snapshot of the process's memory at the time of failure—and save it to a log file. A programmer can later use a debugger to analyze this dump and find the bug.
Control Transfer: After a process ends (normally or abnormally), the OS must give control back to the parent process, which is often the command interpreter (the shell). In a GUI system, this might mean showing a pop-up error message.
Error Levels: A program terminating abnormally can specify an error level (a numeric code) to indicate the severity or type of error. A normal termination is often defined as error level 0. The shell or a subsequent program can automatically check this error level to decide its next action (e.g., retry the operation, send a notification, or continue).
load()andexecute(): This is how one program starts another. For example, when you type a command likelsin your shell, the shell uses these calls to load and run thelsprogram. A critical question here is: what happens to the original program?- If the original program waits for the new one to finish, we need to save its state.
- If they both run at the same time, we have created a new process. This is so common that there is often a dedicated
create_process()orfork()system call.
Process Attributes: Once you have processes, you need to manage them.
get_process_attributes()andset_process_attributes()allow you to check and change things like a process's priority or its maximum allowed running time.terminate_process()lets you kill a process you created.
Synchronization (Waiting and Signaling): Processes often need to coordinate.
- A parent process might
wait_event()for a child process to finish. - The child process then
signal_event()when it is done. wait_time()puts a process to sleep for a specified period.
- A parent process might
Locking for Shared Data: When multiple processes share data (e.g., a file), we need to prevent them from stepping on each other's toes. The
acquire_lock()andrelease_lock()system calls create a critical section, ensuring that only one process can access the shared data at a time.
Real-World Examples: From Simple to Complex¶
To fully grasp the variations in process control, it's helpful to compare two extremes: a simple single-tasking system and a complex multitasking one.
1. Single-Tasking System (Arduino)
Let's consider a simple embedded system like an Arduino. It consists of a microcontroller and input sensors (for light, temperature, etc.).
- Programming: You write a program (called a sketch) on a PC, compile it, and upload it to the Arduino's flash memory via USB.
- The Boot Loader: A small piece of software called a boot loader is responsible for loading your sketch into a specific region of the Arduino's memory.
- Go to Figure 2.9: This figure illustrates the memory layout, showing where the boot loader and the single user sketch reside.
- Execution: Once loaded, the sketch begins running and continuously waits for the events it's programmed to handle (e.g., if a temperature sensor exceeds a threshold, it starts a fan motor).
- Why "Single-Tasking"? The Arduino can only run one sketch at a time. There is no operating system to manage multiple processes. If you load a new sketch, it completely erases and replaces the previous one in memory. The only "user interface" is the physical hardware sensors themselves. Process control here is minimal—just load and run a single program.
2. Multitasking System (FreeBSD/UNIX)
Now, let's contrast the Arduino with a modern, general-purpose operating system like FreeBSD (a UNIX derivative). These systems are inherently multitasking, meaning they can run many processes concurrently.
The Shell and User Commands: When you log in, a shell (command interpreter) starts. When you tell the shell to run a program (e.g.,
gcc myprogram.c), it doesn't replace the shell. Instead, it uses a sophisticated process control mechanism to create a new process.The
fork()andexec()Duo:- Step 1:
fork(): This system call is the first step. The shell usesfork()to create a new process that is an exact duplicate of itself. This new process is called the child process. - Step 2:
exec(): The newly created child process then immediately callsexec(). This system call replaces the child process's current memory space (which was a copy of the shell) with the code and data of the brand new program you wanted to run (likegcc). Now the child process is effectively transformed into thegccprogram.
- Step 1:
Foreground vs. Background Execution:
- After forking, the shell (the parent process) has a choice:
- Wait (Foreground): The shell calls
wait()to pause its own execution until the child process (gcc) finishes. You see the command line "hang" until the program is done. - Don't Wait (Background): The shell does not call
wait(). It immediately gives you a new prompt, allowing you to run other commands. Thegccprocess now runs "in the background." Since the shell is still active and using the keyboard, a background process cannot receive input directly from it; its I/O must be done through files or a GUI.
- Wait (Foreground): The shell calls
- After forking, the shell (the parent process) has a choice:
Process Termination and Exit Status:
- When the child process (
gcc) finishes its work, it callsexit(). This system call terminates the process. - Crucially,
exit()can take an argument: the exit status. By convention, a status of0indicates success, and any non-zero value indicates an error (with different numbers often representing different types of errors). - This exit status is passed back to the parent process (the shell). The shell can then check this status to automatically decide what to do next (e.g., if the compilation failed, it might not attempt to run the program).
- When the child process (
The Big Picture:
- Go to Figure 2.10: This figure shows FreeBSD running multiple programs simultaneously in memory—a process for the shell, a process for a compiler, other user processes, etc. This is the essence of multitasking, enabled by the powerful process control system calls like
fork(),exec(),wait(), andexit().
- Go to Figure 2.10: This figure shows FreeBSD running multiple programs simultaneously in memory—a process for the shell, a process for a compiler, other user processes, etc. This is the essence of multitasking, enabled by the powerful process control system calls like
Examples of Windows and UNIX System Calls¶
This table from the book shows how the generic concepts map to real system calls in Windows and UNIX/Linux. Notice the different naming conventions.
| Category | Windows Example | UNIX Example |
|---|---|---|
| Process Control | CreateProcess(), ExitProcess(), WaitForSingleObject() |
fork(), exit(), wait() |
| File Management | CreateFile(), ReadFile(), WriteFile(), CloseHandle() |
open(), read(), write(), close() |
| Device Management | SetConsoleMode(), ReadConsole(), WriteConsole() |
ioctl(), read(), write() |
| Information Maintenance | GetCurrentProcessID(), SetTimer(), Sleep() |
getpid(), alarm(), sleep() |
| Communications | CreatePipe(), CreateFileMapping(), MapViewOfFile() |
pipe(), shm_open(), mmap() |
| Protection | SetFileSecurity(), InitializeSecurityDescriptor() |
chmod(), umask(), chown() |
The Standard C Library: A System-Call Wrapper¶
You rarely make a system call directly in a high-level language. Instead, you use functions provided by a standard library. The book uses the Standard C Library as a prime example.
- How it works: When your C program calls
printf("Greetings"), you are not making a system call directly.- The
printf()function in the C library is called. - This library function contains the necessary code to make the actual
write()system call to the operating system kernel. - The kernel performs the write operation.
- The return value from the
write()system call is passed back to theprintf()function, which may then return it to your program.
- The
This layer of abstraction makes programming much easier. You don't need to know the gritty details of how to talk to the kernel; you just use a well-documented, portable function like printf(). The library handles the system-specific work for you.
2.3.3.2 File Management¶
This section introduces the fundamental system calls for working with files. A more detailed discussion of file systems is covered later in the book (Chapters 13 through 15).
Think of the Operating System (OS) as a librarian and files as books. The system calls are the standardized requests you make to the librarian to manage those books.
Here are the essential system calls for file operations:
create()anddelete(): These are your most basic operations.create(): You provide a file name (e.g., "my_essay.txt") and, optionally, some attributes (like access permissions), and the OS creates a new, empty file for you.delete(): You specify the name of the file, and the OS removes it from the file system, freeing up its space.
open()andclose(): Before you can actually read from or write to a file, you mustopen()it. This tells the OS, "I'm about to start working with this file." The OS then does some internal preparation, like checking your permissions. When you're done, youclose()the file. This tells the OS you're finished, allowing it to clean up and ensure all data is properly saved.read(),write(), andreposition(): Once a file is open, you use these calls to interact with its content.read(): Retrieves data from the file.write(): Puts new data into the file or modifies existing data.reposition(): This moves the "current position" pointer within the file. For example, you can rewind to the start of the file or skip to the end. This is also often called "seeking."
Directories:
If files are books, directories are the shelves and sections that organize them. The same set of operations—create, delete, open, close, etc.—are typically used for directories as well.
File Attributes (Metadata): Every file has associated information, known as attributes or metadata. This isn't the file's content, but information about the file. Examples include:
- File Name
- File Type (e.g., .txt, .pdf)
- Protection Codes (who can read, write, or execute it)
- Accounting Information (size, creation time, last modified time)
To manage this, at least two system calls are needed:
get_file_attributes(): To check this information.set_file_attributes(): To change it (e.g., to change file permissions).
Additional Functionality:
Some operating systems provide direct system calls for high-level operations like move() and copy(). Others might only provide the basic building blocks (like read and write) and let you build a copy command using those. Finally, an OS might provide a ready-to-use program (like cp in Linux) that anyone can run. If this program can be called by other programs, it effectively acts as an Application Programming Interface (API).
2.3.3.3 Device Management¶
A process needs resources to run, such as CPU time, main memory, and disk drives. The OS acts as a resource manager. If a resource is free, the OS grants it to the process. If not, the process must wait.
We can think of all these resources as devices. Some are physical, like a printer or a disk drive. Others are abstract or virtual, like a file. Managing these devices involves specific system calls.
request()andrelease(): In a multi-user system, you can't have everyone printing to the same printer at the same time. To manage this, a process must firstrequest()a device. If it's available, the OS grants exclusive access to that process. When the process is finished, it mustrelease()the device so others can use it. This is very similar to theopen()andclose()system calls for files.- Important Note: Some OSes allow unmanaged access, but this can lead to problems like device contention (two processes trying to use the same device simultaneously, causing chaos) and deadlock (a situation where two or more processes are waiting for each other to release a resource, causing everything to grind to a halt). See Chapter 8 for more on deadlock.
read(),write(), andreposition(): Once a device is requested and allocated, you interact with it using the same familiar calls. You canread()from a scanner,write()to a disk drive, orreposition()the read/write head of a hard drive.
Unified File-Device Structure:
The similarity between file and device operations is so strong that many operating systems, like UNIX and Linux, merge them. In this model, everything is treated as a file. Your keyboard, your printer, and your hard disk all appear as special files in the file system. This allows programmers to use a single, consistent set of system calls (like read and write) to communicate with many different devices. These devices might be identified by special file names (e.g., /dev/usb), their location in the directory tree, or specific file attributes.
User Interface Design: Even if the underlying system calls for files and devices are different, the user interface (like the command line) can be designed to make them appear similar. This is a key design decision in OS development, aiming to create a more consistent and user-friendly experience.
2.3.3.4 Information Maintenance¶
This category covers system calls that are used to exchange information between a user program and the operating system itself. These calls don't primarily create files or manage devices; instead, they are used to query the state of the system, to get information about a process, or to aid in development and debugging.
Getting System Information: Many system calls exist simply to provide the program with details about the OS or the computer's environment. These are like asking the OS for a status report. Common examples include:
get_time()andget_date(): Returns the current time and date from the system clock.get_system_info(): Returns information like the operating system's version number, the amount of free physical memory, the amount of free disk space, or the number of currently running processes.
Debugging and Profiling: A crucial set of system calls helps programmers find and fix errors in their code (debugging) and analyze its performance (profiling).
dump(): This system call tells the OS to take a snapshot of the program's memory (or a part of it) and write it to a file or the screen. By examining this memory dump, a programmer can see the exact state of the program at the moment it crashed or encountered an error. This is an invaluable debugging tool.Tracing System Calls: Tools like
straceon Linux use low-level OS facilities to intercept and log every system call a program makes. This allows you to see the exact conversation between your program and the OS, which is extremely useful for understanding complex program behavior or finding where a program is failing.Single-Step Mode (Hardware Support): This is a feature provided by the CPU itself, which the OS and debuggers leverage. When enabled, the CPU executes one machine instruction and then generates a trap (a special type of interrupt). This trap is caught by a debugger program, which can then show the programmer what happened after each single instruction. This allows for very precise, instruction-by-instruction execution.
Time Profiling: This technique helps you understand where your program is spending most of its CPU time. It works by periodically sampling the value of the program counter (PC).
- The OS sets up a timer to generate interrupts at regular, frequent intervals (e.g., every 1 millisecond).
- At each interrupt, the OS records the current program counter address.
- After the program finishes running, the OS (or a profiler tool) can analyze all the sampled addresses. If a particular function or code segment has a high number of samples, it means the program was executing there frequently, indicating a "hot spot" that might be worth optimizing for better performance.
Process Information: The operating system maintains a detailed record, often called a Process Control Block (PCB), for every running process. This record contains all the vital statistics about the process. System calls are used to access and modify this information.
get_process_attributes(): Retrieves information about a process, such as its Process ID (PID), its current state (running, waiting, etc.), its priority, what files it has open, and its CPU usage.set_process_attributes(): Allows certain attributes of a process to be changed. For example, a system administrator might use this to change the priority of a process to give it more or less CPU time.
Refer to Section 3.1.3 for a detailed list of the specific information that an operating system typically keeps about each process.
2.3.3.5 Communication¶
This section covers how processes talk to each other, a concept known as Interprocess Communication (IPC). There are two primary models for this: Message-Passing and Shared Memory.
The Message-Passing Model¶
In this model, processes communicate by explicitly sending and receiving discrete packets of data, called messages. Think of it like sending letters or emails.
Connection Setup: Before any communication can happen, a connection must be established. This is similar to knowing someone's address before you can mail them a letter. To do this, a process needs to identify the other process it wants to talk to. This involves:
get_hostid(): If the other process is on a different computer on a network, this call translates a well-known hostname (e.g., "server.com") into a network identifier like an IP address.get_processid(): This call translates a process name into a numeric Process ID (PID) that the operating system uses to uniquely identify the process.
Establishing a Link: Once the identifiers are known, a connection is opened using calls like
open_connection()or by using the general file system'sopen()call on a special communication channel. The receiving process must give its permission to communicate using anaccept_connection()call.Client-Server Architecture: This is a very common pattern. A server (a special-purpose, always-running program called a daemon) executes a
wait_for_connection()call, sleeping until someone wants to talk to it. A client process then initiates the connection. Once the connection is established, they exchange data usingread_message()andwrite_message()system calls.Closing the Link: Communication is terminated with a
close_connection()call.Mailboxes: Messages can also be passed indirectly through a common mailbox (or message queue), where processes leave and pick up messages without needing a direct, active connection at the same time.
The Shared-Memory Model¶
This model takes a completely different approach. Instead of sending data back and forth, processes agree to share a section of their memory. Think of this like a shared whiteboard that multiple people can read from and write to.
Creation and Attachment: Normally, the OS strictly prevents one process from accessing another's memory for safety. To bypass this, processes must explicitly create and attach to a shared memory region.
shared_memory_create(): This system call creates a block of memory that can be accessed by multiple processes.shared_memory_attach(): This call grants a process access to an already created shared memory segment.
Direct Access and Responsibility: Once attached, processes can read and write data directly to this shared area simply by using pointers and memory addresses, just like accessing their own memory. This is extremely fast. However, the OS is no longer managing the data transfer. The processes themselves are entirely responsible for:
- The Format of the Data: They must agree on what the data in the memory means.
- Synchronization: They must ensure they are not reading and writing to the same location simultaneously, which would lead to corrupted data. (Mechanisms for this, like semaphores, are discussed in Chapter 6).
Comparison of the Two Models:
- Message Passing is better for exchanging smaller amounts of data and is simpler because the OS handles the synchronization. It is also the only practical choice for communication between computers over a network.
- Shared Memory is the fastest form of IPC because it works at memory speed. It is ideal for high-performance computing on a single machine where large amounts of data need to be exchanged. The main disadvantages are the complexity of managing synchronization and the potential for errors if not done correctly.
(Refer to Chapter 4 for a discussion on threads, which are a variation of processes that share memory by default.)
2.3.3.6 Protection¶
Protection is the mechanism that controls what resources a user or a process is allowed to access. Historically, this was only a major concern for large, multi-user systems. Today, with networking so pervasive, every computer—from servers to your smartphone—must have robust protection.
Protection is about enforcing the rules that define who can do what.
Key System Calls for Protection:
set_permission()andget_permission(): These calls are used to manipulate the access rights (e.g., read, write, execute) for resources like files and disks.allow_user()anddeny_user(): These calls are used to explicitly grant or revoke a specific user's access to a particular resource.
Protection vs. Security: It is important to distinguish between these two related concepts:
- Protection: The internal mechanism (the rules and the enforcement) that controls how programs and users access system resources. (Covered in Chapter 17).
- Security: The larger, external defense of a system against external threats (like hackers and malware). Security uses protection mechanisms as one of its primary tools. (Covered in Chapter 16).
2.4 System Services (System Utilities,System Programs)¶
Introduction: The Ecosystem on Top of the OS¶
Recall the logical computer hierarchy from Figure 1.1: Hardware -> Operating System -> System Services -> Application Programs. System services, also known as system utilities, are programs that provide a convenient environment for developing and running other programs. They are not part of the operating system kernel but are bundled with it. Some are simple wrappers around system calls, while others are very complex.
These services can be categorized as follows:
Categories of System Services¶
1. File Management These utilities help users and programs manipulate files and directories. Common operations include:
- Creating, deleting, copying, and renaming files and directories.
- Printing, listing, and displaying file contents.
- Managing file permissions and attributes.
2. Status Information These programs query the system for information. They range from simple to complex:
- Simple: Getting the date, time, available memory, disk space, or number of users.
- Complex: Providing detailed performance data, system logs, and debugging information. The output can be sent to a terminal, a file, or a GUI window. Some systems, like Windows, use a registry—a central database for storing and retrieving system and application configuration information.
3. File Modification These are utilities for creating and editing the content of text files.
- Text Editors: Programs like
vim,nano, or GUI editors allow users to create and modify files. - Text Processing Commands: Special utilities to search file contents (e.g.,
grep) or transform text (e.g.,sed,awk).
4. Programming-Language Support To support software development, operating systems often come with or provide easy access to:
- Compilers (e.g., for C, C++), assemblers, debuggers, and interpreters (e.g., for Python, Java). These may be included or available as a separate download.
5. Program Loading and Execution Once a program is compiled, it must be loaded into memory. The system provides tools for this:
- Loaders: Programs like absolute loaders, relocatable loaders, and linkage editors handle the complex process of preparing an executable file and loading it into memory for execution.
- Debuggers: Systems for debugging programs, both at the high-level language and machine code levels.
6. Communications These programs establish connections between processes, users, and different computer systems. They enable:
- Sending instant messages between users.
- Browsing web pages.
- Sending and receiving email.
- Remote login (e.g., using
ssh). - Transferring files between machines (e.g., using
ftporscp).
7. Background Services (Daemons) General-purpose systems launch many system processes automatically at boot time. Constantly running system processes are known as services, subsystems, or daemons.
- Examples:
- The network daemon that listens for incoming connections.
- Process schedulers that start jobs at specific times.
- System error monitoring services.
- Print servers that manage print jobs. A typical system has dozens of such daemons running in the background. They are essential for performing system-level tasks outside the kernel.
Application Programs and the User's View¶
Along with system programs, operating systems are typically distributed with useful application programs for common tasks, such as:
- Web browsers, word processors, spreadsheets.
- Database systems, compilers, games.
- Plotting and statistical-analysis packages.
The Crucial Point: For most users, the "operating system" is defined by these application and system programs, not by the underlying system calls. The user interface shapes the entire experience.
- Example 1: A user on a Mac sees the Aqua GUI and the UNIX shell in a terminal window. Both interfaces use the same set of macOS system calls, but they present them in completely different ways.
- Example 2: A user can dual-boot the same PC into either macOS or Windows. This means the same user, on the same hardware, interacts with two entirely different sets of applications and interfaces, all built upon different implementations of system services and system calls that manage the same physical resources.
2.5 Linkers and Loaders¶
This section explains the journey a program takes from being source code on a disk to a running process in memory. The steps involved are compiling, linking, and loading, which are visually summarized in Figure 2.11.
The Steps from Source Code to Execution¶
Compilation (
gcc -c main.c):- Your source code (e.g.,
main.c) is processed by a compiler. - The output is an object file (e.g.,
main.o). - This object file is in a relocatable object file format. This means the machine code inside is not yet tied to a specific memory address; it's designed to be loaded into any available location in physical memory.
- Your source code (e.g.,
Linking (
gcc -o main main.o -lm):- The linker takes one or more relocatable object files (like
main.o) and combines them into a single binary executable file (e.g.,main). - The linker also incorporates code from libraries (like the math library, specified by
-lm). These libraries are collections of pre-compiled object files for common tasks. - A key job of the linker is relocation. It assigns final memory addresses to the various parts of the program (code, data) and then adjusts all the references within the code to match these final addresses. For example, it ensures a call to
printf()actually goes to the correct location of theprintffunction in the standard library.
- The linker takes one or more relocatable object files (like
Loading (
./main):- The loader is the part of the operating system that places the executable file into memory so it can run.
- When you type a command like
./mainin a shell, the shell first uses thefork()system call to create a new process. It then uses theexec()system call to invoke the loader. - The loader loads the program's code and data into the memory of the newly created process. Once loaded, the program is eligible to be scheduled to run on a CPU core.
Static vs. Dynamic Linking¶
The process described above implies that all libraries are fully copied ("statically linked") into the executable file. However, most systems use a more efficient method called dynamic linking.
Statically Linked Libraries: The library code is physically copied into the executable file at link time. This creates a large, self-contained executable that doesn't depend on external files, but it wastes disk and memory space if multiple programs use the same library.
Dynamically Linked Libraries (DLLs in Windows, Shared Objects in Linux): The library code is not copied into the executable. Instead, the linker only inserts some relocation information (a "stub") that tells the loader, "This program needs this library."
- When the program is loaded, the loader sees this information and dynamically links the required library, loading it into memory.
- Major Benefit: Multiple running processes can all share a single copy of a dynamically linked library in memory, leading to significant memory savings. (This is covered in more detail in Chapter 9).
Executable File Formats¶
Object and executable files are not just raw machine code; they have a standard structure that includes the code itself and a symbol table (metadata about functions and variable names used in the program).
ELF (Executable and Linkable Format): This is the standard format used by UNIX, Linux, and many other systems. There are different ELF formats for relocatable object files (
*.o) and final executable files.- The
filecommand can be used to identify a file's type (e.g.,file main.oreports it as an ELF relocatable file). - The
readelfcommand can inspect the detailed sections of an ELF file. - A crucial piece of information in an executable ELF file is the program entry point—the memory address of the very first instruction to be executed when the program starts.
- The
Other Formats: Windows uses the PE (Portable Executable) format, and macOS uses the Mach-O format. They serve the same fundamental purpose as ELF.
2.6 Why Applications Are Operating-System Specific¶
This section explains a fundamental reality of computing: a program compiled for Windows won't run on macOS, and an Android app won't run on an iPhone. While it would be ideal if all programs were universally compatible, several technical barriers prevent this.
The Core Problem: A House Divided¶
An application and an operating system are built to work together like a custom key and a specific lock. The main reasons for this incompatibility are:
- Different System Calls: As discussed earlier, each OS provides a unique set of system calls. An application written for one OS is full of requests for that OS's specific "services." Another OS simply won't understand these requests.
- Different Executable File Formats: Each OS has a specific binary format for executable files (like ELF for Linux, PE for Windows, Mach-O for macOS). The OS's loader expects the file to be structured in a precise way with a specific header and layout. A format it doesn't recognize cannot be loaded. Go to Figure 2.11 to see the loader's role in this process.
- Different CPU Instruction Sets: Even on the same physical hardware, different operating systems might manage the CPU differently. More fundamentally, applications are compiled into machine code for a specific CPU architecture (like Intel x86 or ARM). An ARM CPU cannot execute instructions meant for an x86 CPU, and vice-versa.
How Cross-Platform Applications Are Made Possible¶
Despite these barriers, you do see applications like Firefox or Python scripts running on multiple systems. This is achieved through one of three primary methods:
Interpreted Languages (e.g., Python, Ruby):
- The application is distributed as human-readable source code.
- An interpreter, which is a native program written for each specific OS, reads the source code line-by-line and executes the equivalent native OS instructions on the fly.
- Drawbacks: Performance is slower than native code because of the real-time translation. The application is also limited to the features provided by the interpreter, which may not expose all the advanced features of the underlying OS.
Virtual Machines and Runtime Environments (e.g., Java):
- The application is compiled into an intermediate, platform-neutral code called bytecode (e.g., Java
.classfiles). - A Runtime Environment (RTE), like the Java Virtual Machine (JVM), is a native program that is ported to many different operating systems. This RTE loads the bytecode and translates it into native machine instructions for the host OS and CPU as the program runs.
- Drawbacks: Similar to interpreters, this "just-in-time" compilation can impact performance, and the application is confined to the "sandbox" of the virtual machine.
- The application is compiled into an intermediate, platform-neutral code called bytecode (e.g., Java
Porting the Application (e.g., using POSIX API):
- The developer writes the application in a standard language (like C) and uses a standardized API (like POSIX for UNIX-like systems).
- To run on a different OS, the application's source code must be ported. This means it is recompiled from source using a compiler and libraries specifically for that target operating system.
- Drawback: This is time-consuming and expensive, as it requires a separate build, testing, and debugging cycle for each operating system version.
The Deeper Challenges: APIs and ABIs¶
Even with the methods above, creating a true "write once, run anywhere" application is very difficult due to deeper system-level differences.
Application Programming Interface (API): This is a high-level set of functions for building software. An application designed to use the iOS API for its user interface will not work on Android, which provides a completely different set of GUI APIs.
Application Binary Interface (ABI): This is the low-level equivalent of an API. An ABI is a strict contract that defines exactly how binary code should interact with a specific operating system on a specific CPU architecture. It dictates:
- How parameters are passed to system calls.
- The organization of the program's stack in memory.
- The binary format of system libraries.
- The sizes of fundamental data types (e.g.,
int,long). An ABI is defined for a specific combination, like "Linux on ARMv8." A binary compiled for one ABI will not work on a system with a different ABI.
In summary, the combination of unique system calls, different executable file formats, varying CPU architectures, and the strict requirements of APIs and ABIs means that an application is inherently tied to its target platform. The cross-platform applications we use represent a massive engineering effort to port, interpret, or virtualize the code for each specific operating system and hardware combination.
2.7 Operating-System Design and Implementation¶
Designing and building an operating system (OS) is a complex challenge with no single "correct" answer. However, there are successful strategies and principles that guide developers. This section covers the initial design goals and a fundamental principle for creating flexible systems.
2.7.1 Design Goals¶
The very first step in creating an OS is to define what it should do. The high-level goals are shaped by two main factors:
- The Hardware: What type of computer will it run on?
- The System Type: Is it for a traditional desktop/laptop, a mobile device, a distributed network, or a real-time control system?
Once these are known, requirements can be split into two groups: what users want and what system developers need.
User Goals¶
These are the qualities an end-user (like you) cares about. A system should be:
- Convenient, Easy to Learn, and Easy to Use
- Reliable (it doesn't crash)
- Safe (secure from threats)
- Fast
While these goals are obvious to the user, they are too vague for a developer. There's no universal agreement on how to specifically achieve "easy to use" or "fast" in code.
System Goals¶
These are the qualities important for the people who design, build, and maintain the OS. The system should be:
- Easy to Design, Implement, and Maintain
- Flexible (able to adapt to new requirements)
- Reliable and Error-Free
- Efficient (in its use of hardware resources)
Like user goals, these are also somewhat vague and open to interpretation.
The Key Takeaway: There is no single set of requirements for all operating systems. The "right" design depends entirely on the environment. For example:
- VxWorks (a real-time OS for embedded systems like Mars rovers) has vastly different requirements than Windows Server (an OS for managing large enterprise networks).
Designing an OS is a creative task. While there's no recipe, software engineering provides useful principles to guide the process.
2.7.2 Mechanisms and Policies¶
A crucial principle for building a flexible and maintainable OS is the separation of policy from mechanism.
- Mechanism: This refers to how something is done. It's the underlying algorithm or code that provides a capability.
- Policy: This refers to what will be done. It's the decision-making rule that uses the mechanism.
A Simple Analogy: Think of a traffic light.
- The mechanism is the light itself—its wiring, timers, and bulbs that make it change colors.
- The policy is the decision of how long the light stays green for the main road versus the side road.
Why is this Separation Important?¶
The main benefit is flexibility. Policies change often, but mechanisms do not. If they are tightly bound together, every time you want to change a policy, you have to rewrite the mechanism.
By keeping them separate, you can support different policies by simply changing a parameter for the general mechanism.
OS Example 1: CPU Protection
- The mechanism is the timer (as discussed in Section 1.4.3), which is a piece of hardware that interrupts the CPU after a set time.
- The policy is deciding how long that timer is set for each user. Should each user get 100 milliseconds or 200 milliseconds? This is a policy decision that can be changed without altering the timer mechanism itself.
OS Example 2: Process Priority
- The mechanism is a system that can assign and enforce different priority levels for programs.
- The policy is the rule for assigning those priorities. Should I/O-intensive programs (like a video player) have higher priority than CPU-intensive programs (like a scientific calculation)? The same mechanism can be used to enforce either policy just by changing the rule.
This Principle in Practice¶
Microkernel OS (like Mach): Take this idea to the extreme. They provide only the most basic, policy-free mechanisms (the bare-minimum building blocks). All higher-level policies (like how to schedule the CPU) are built on top of this, often by user-level programs(user made kernel), making the system very flexible.
Windows and macOS: Use a different approach. They build both the mechanism and policy directly into the kernel. This enforces a consistent "look and feel" and behavior across all devices running that OS. The trade-off is less flexibility for the sake of a uniform user experience.
Open Source vs. Commercial (Linux vs. Windows):
- The standard Linux kernel comes with a specific CPU scheduler (a mechanism with a default policy). However, because the code is open, anyone can modify or replace this scheduler to implement a different policy.
- In contrast, with Windows, the scheduling mechanism and policy are fixed within the kernel and cannot be easily changed by the user.
Final Note: Policy decisions are needed whenever a resource (CPU time, memory, etc.) must be allocated. The question "What should be done?" is a policy question. The question "How do we do it?" is a mechanism question. Keeping them separate is a hallmark of good OS design.
2.7.3 Implementation¶
After the design of an operating system is planned, the next step is to actually build it—this is the implementation phase. Operating systems are massive collections of programs, often developed by hundreds of people over many years, so it's hard to generalize about how they are built. However, we can look at the common tools and languages used.
The Shift from Assembly to High-Level Languages¶
- Early Days: The first operating systems were written entirely in assembly language. This is a very low-level programming language that has a direct, one-to-one relationship with the machine's hardware instructions.
- Modern Practice: Today, almost all operating systems are written primarily in high-level languages like C or C++. Only a very small, critical parts of the system are still written in assembly language.
Modern OSs often use a mix of languages for different layers:
- Lowest Level (The Kernel Core): Written in C and a little bit of Assembly.
- Higher-Level Routines: Written in C and C++.
- System Libraries & Frameworks: Can be written in C++ or even higher-level languages like Java.
A Concrete Example: Android The book uses Android as a great real-world example of this layered language approach (we'll cover its full architecture in more detail in Section 2.8.5.2). Its components are built with different languages:
- Kernel: Mostly C with some Assembly.
- System Libraries: C or C++.
- Application Frameworks (the interface for app developers): Mostly Java.
Advantages of Using High-Level Languages¶
Using languages like C/C++ instead of assembly offers the same benefits for OS development as it does for regular application programming:
- Faster Development: Code is written more quickly.
- Compactness: The source code is more concise and smaller.
- Understandability and Debugging: The code is easier for humans to read, understand, and fix.
- Compiler Benefits: When compiler technology improves, you can simply recompile the entire OS to get better, more optimized machine code without changing a single line of your original source code.
- Portability: This is a huge advantage. An OS written in a high-level language is much easier to adapt ("port") to run on different hardware (like moving from Intel x86 chips to ARM chips in phones and tablets). The alternative—rewriting the entire OS in a new assembly language for each hardware platform—would be a monumental task.
Addressing the Potential Disadvantages¶
You might wonder, "But isn't assembly language faster?" The potential downsides of using a high-level language are:
- Reduced speed of the final program.
- Increased memory/storage requirements.
However, the book explains that these are not major issues for modern systems. Here’s why:
Modern Compilers are Brilliant: While an expert human can write super-efficient small routines in assembly, a modern compiler can analyze entire large programs and apply complex optimizations that are beyond human capability. It can perfectly handle the intricate details of modern processors with features like deep pipelining and multiple execution units.
Algorithm over Code: The biggest performance gains in an OS come from using better data structures and algorithms, not from hand-optimizing every line in assembly. A smart algorithm in C will almost always beat a mediocre one written in perfect assembly.
Only a Small Part is Critical: Although operating systems are large, only a tiny fraction of the code is "performance-critical." The key routines where speed is absolutely essential are:
- Interrupt handlers
- I/O Manager
- Memory Manager
- CPU Scheduler
The standard practice is to first get the entire system working correctly using high-level languages. Once it's working, developers can use profiling tools to identify performance bottlenecks. Only these specific, critical sections are then optimized or carefully rewritten for maximum efficiency.
2.8 Operating-System Structure¶
A modern operating system is incredibly large and complex. To manage this complexity and make the system easier to build, maintain, and modify, it must be carefully organized. A standard engineering approach is to break the huge task down into smaller, manageable pieces called modules.
Each module should be a well-defined part of the system with clear responsibilities and cleanly specified interfaces for interacting with other modules.
A Programming Analogy: Think about how you structure a large program. You don't put all your code inside the main() function. Instead, you separate logic into multiple functions, each with a clear purpose, defined parameters, and return values. You then call these functions from main(). Structuring an OS uses the same fundamental principle but on a much larger scale.
We learned about the common components of an OS (like the process scheduler, memory manager, etc.) in Chapter 1. Now, we will look at how these components are connected and fused together to form the kernel.
2.8.1 Monolithic Structure¶
The simplest way to structure an operating system kernel is to use no formal structure at all. This approach is called the Monolithic Structure.
In a monolithic kernel, all the functionality of the kernel—the file system, CPU scheduler, memory manager, device drivers, etc.—is packed into a single, large, static binary file. This entire kernel runs in a single, powerful memory area known as kernel space (a single address space).
The UNIX Example¶
The original UNIX operating system is a classic example of this approach, though with some minor structuring. It was divided into two main parts:
- The kernel
- The system programs (everything outside the kernel)
The kernel itself was a monolithic entity, but as it evolved, it was separated into various interfaces and device drivers.
Go to Figure 2.12: Traditional UNIX system structure.
This figure shows a simplified, layered view of traditional UNIX. Notice that everything below the system-call interface and above the physical hardware is the kernel. It's a single block that contains a huge amount of functionality, from the file system and CPU scheduling down to the drivers that talk directly to the disks and memory.
The Linux Example¶
Linux, which is based on UNIX, also uses a monolithic kernel. Go to Figure 2.13: Linux system structure.
This figure shows how applications use the glibc (GNU C Library) to make system calls into the kernel. The Linux kernel itself runs entirely in kernel mode in a single address space, making it monolithic.
However, an important modern feature is that Linux has a modular design. This means you can add and remove kernel functionality (like device drivers) while the system is running, which we will discuss in Section 2.8.4. This modularity helps manage the complexity of a monolithic kernel.
Advantages and Disadvantages of a Monolithic Kernel¶
Disadvantage: Complexity and Inflexibility
- They are difficult to implement correctly and can be hard to extend or modify because all the parts are intertwined. A bug in one driver can crash the entire kernel.
Advantage: Performance
- This is the main reason monolithic kernels are still widely used. Because everything is in one address space:
- System calls are fast (there's very little overhead when switching into the kernel to perform a task).
- Communication within the kernel is extremely fast (functions can just call each other, as there are no barriers between components).
- This is the main reason monolithic kernels are still widely used. Because everything is in one address space:
Conclusion: Despite the drawbacks in complexity, the raw speed and efficiency of monolithic kernels are so significant that they are the foundation of major operating systems like UNIX, Linux, and Windows.
2.8.2 Layered Approach¶
The monolithic kernel we just discussed is often called a tightly coupled system because all its parts are interconnected; a change in one part can have unexpected and wide-ranging effects on other parts.
To address this, we can design a loosely coupled system. In this design, the kernel is divided into separate, smaller components where each has a specific and limited job. The big advantage of this modular approach is that a change in one component only affects that component, giving developers much more freedom to modify and improve parts of the system in isolation.
There are several ways to build a modular system. One classic method is the Layered Approach.
What is the Layered Approach?¶
In this model, the operating system is broken down into a series of hierarchical layers (or levels). Go to Figure 2.14: A layered operating system.
- The lowest layer, Layer 0, is the hardware itself.
- The highest layer, Layer N, is the user interface.
- Every layer in between is a step in the hierarchy, building upon the one below it.
How a Layer Works¶
An operating-system layer is essentially an abstract object. This means it is made up of:
- Data: The information it manages.
- Operations: The set of functions that can manipulate that data.
Here's how the layers interact:
- A typical layer, let's call it Layer M, can be invoked by the higher-level layer above it, Layer M+1.
- In order to do its job, Layer M can, in turn, invoke operations on the lower-level layer beneath it, Layer M-1.
Advantages of the Layered Approach¶
The primary advantage is simplicity in construction and debugging. The system is built like a stack of building blocks, from the bottom up.
Easier Debugging: You debug the system one layer at a time, starting from the bottom.
- You debug Layer 0 (which interacts directly with the hardware) without worrying about any other layers because it relies on nothing else (assuming the hardware works).
- Once Layer 0 is confirmed correct, you debug Layer 1. Any error you find must be in Layer 1, because you've already proven Layer 0 is correct.
- This process continues upward. This makes finding the source of a bug much more straightforward.
Information Hiding (Abstraction): Each layer only needs to know what the layer below it does, not how it does it. Layer M uses the services of Layer M-1 as a "black box." This hides the complex details of data structures, operations, and hardware from the higher-level layers, simplifying the design of each layer.
Disadvantages and Why It's Rarely Used Pure¶
While this approach is very clean, relatively few modern operating systems use a pure layered approach. Two major reasons are:
Difficulty in Layer Definition: It is very challenging to define the functionality of each layer perfectly. For example, should the virtual memory manager (which needs to swap pages to disk) be above or below the disk driver layer? The virtual memory layer needs the disk driver, but putting it above would violate the rule that a layer can only use the layer immediately below it.
Performance Overhead: This is a critical flaw. In a strictly layered system, a simple user request (like writing to a file) might have to pass down through many layers (e.g., user program -> file system layer -> memory management layer -> disk scheduler layer -> disk driver layer), with each step adding a small amount of overhead. This can make the overall system slower compared to a monolithic kernel where components can communicate directly.
Conclusion: You will see successful layered designs in other areas, like the TCP/IP network protocol stack. In modern OS design, the pure layered approach is not common. However, the general idea of modularity is essential. Most contemporary systems use a hybrid approach with fewer, more powerful layers, gaining the benefits of modular, maintainable code while avoiding the strict performance and definition problems of a pure layered model.
2.8.3 Microkernels¶
As we learned, the original UNIX kernel was monolithic and became large and difficult to manage as it grew. To solve this problem, researchers developed a different structural approach called the Microkernel.
This method aims to simplify the kernel by removing all non-essential components from it. These removed components (like file systems, device drivers, etc.) are then implemented as standard user-level programs that run in their own separate address spaces. The result is a much smaller, minimal kernel—the "micro" kernel.
There's no absolute rule on what is "essential," but typically, a microkernel provides only the most fundamental services:
- Minimal Process Management (e.g., creating threads/tasks)
- Minimal Memory Management (e.g., basic address space management)
- Interprocess Communication (IPC) - This is the most critical service.
Go to Figure 2.15: Architecture of a typical microkernel.
This figure shows how the system is structured. The microkernel itself is small and sits at the core. Other services, like the file system and device drivers, are now separate "servers" or "programs" running in user space. The application program interacts with these services through the microkernel.
How it Works: Message Passing¶
The main job of the microkernel is to act as a secure messenger, providing communication between client programs (like your application) and the various services (like the file server) using a technique called message passing (which we discussed in Section 2.3.3.5).
Example: When your application wants to read a file, here is the process:
- Your application (the client) cannot talk directly to the file server.
- Instead, it sends a message to the microkernel saying, "I want to read from this file."
- The microkernel safely passes this message to the file server, which is running as a user process.
- The file server does the work, gets the data, and sends a reply message back to the microkernel.
- The microkernel then passes this reply message back to your application.
The client and service never interact directly; all communication is brokered by the microkernel.
Advantages of the Microkernel Approach¶
- Easier Extension: Adding a new service (e.g., a new file system) does not require modifying the kernel. You simply add a new user-level program. This makes the system more modular and flexible.
- Portability: Because the kernel is so small, it's easier to adapt the entire operating system to new hardware platforms.
- Security and Reliability: This is a major benefit. Since most services run as user processes, they are isolated from the kernel and from each other. If a file server crashes, it can be restarted without bringing down the entire machine. A bug in a device driver won't crash the kernel.
Real-World Examples:
- Darwin: This is the kernel at the heart of macOS and iOS. Darwin actually contains the Mach microkernel as one of its core components. We will explore this more in Section 2.8.5.1.
- QNX: A very reliable real-time OS used in embedded systems (like cars and medical devices). Its Neutrino microkernel is tiny and only handles core tasks; everything else is a user process.
The Major Disadvantage: Performance¶
The biggest problem with microkernels is performance overhead.
- Message Copying: When two services need to communicate, messages must be copied between their separate address spaces. This is slower than the simple function calls used in a monolithic kernel.
- Context Switching: The OS must constantly switch between processes—from the application, to the kernel, to the server, and back. Each of these "context switches" takes a significant amount of time.
This performance hit has been the main reason microkernels are not more widespread.
A Case Study: Windows NT
- The first version of Windows NT was designed with a microkernel structure.
- Its performance was poor compared to the monolithic Windows 95.
- To fix this, Microsoft in Windows NT 4.0 moved key services back from user space into kernel space, making the architecture more monolithic to gain speed.
- This trend continued, and modern Windows versions are now more monolithic than microkernel.
We will see in Section 2.8.5.1 how macOS uses clever techniques to mitigate the performance penalties of its Mach microkernel.
2.8.4 Modules¶
The most prevalent design methodology in modern operating systems is the use of Loadable Kernel Modules (LKMs). This approach tries to get the best of all worlds: the performance of a monolithic kernel and the flexibility of a microkernel.
In this design, the kernel is structured as a set of core components and can dynamically link in additional services via modules. These modules can be loaded when the system boots or, crucially, while the system is running.
The main idea is to keep the fundamental services in the core kernel, while implementing other, more specific services as dynamically loadable pieces. This is far superior to the old method of building every feature directly into the kernel, which would require a lengthy kernel recompilation for every new device or file system you wanted to support.
Example: The kernel might have its core CPU scheduler and memory manager built-in. Support for a specific file system (like NTFS) or a new device (like a unique graphics card) would then be added as a loadable module.
Relationship to Other Structures¶
The modular approach is a hybrid that borrows ideas from earlier models:
- Compared to Layered Systems: It is similar because each module has well-defined and protected interfaces. However, it is more flexible because a module is not restricted to only calling the module below it; any module can call any other module.
- Compared to Microkernels: It is similar because the kernel keeps its core small and loads other services. However, it is much more efficient because modules are loaded directly into kernel space. They can communicate with each other using fast function calls, avoiding the slow message passing and context switching between user-space processes that plagues pure microkernels.
The Linux Example¶
Linux uses loadable kernel modules extensively, especially for:
- Device Drivers
- File System Support
This is why, when you plug a new USB device into a running Linux machine, the system can automatically find and load the necessary driver module without needing to reboot. Similarly, modules can be unloaded from the kernel when they are no longer needed.
The Result for Linux: LKMs give Linux the dynamic and modular qualities that make it easy to extend, while preserving the raw performance benefits of a monolithic kernel where everything runs in the same address space.
2.8.5 Hybrid Systems¶
In the real world, almost no operating system uses a single, pure structural model. Instead, they combine the best ideas from different models to create Hybrid Systems that balance performance, security, and usability.
Let's look at how major OSs are hybrids:
- Linux: Is fundamentally monolithic (for performance) but uses loadable kernel modules (for flexibility).
- Windows: Is also largely monolithic (for performance), but it incorporates microkernel ideas. For example, it runs some subsystems (called "personalities") as separate user-mode processes for isolation, and it also supports dynamically loadable kernel modules.
We will now explore the architectures of three specific hybrid systems: macOS, iOS, and Android.
2.8.5.1 macOS and iOS¶
This section details the architecture of Apple's operating systems: macOS for desktops/laptops and iOS for mobile devices (iPhone/iPad). While designed for different hardware, they share a significant amount of underlying architecture. The general structure is layered, as shown in the diagram.
Go to Figure 2.16: Architecture of Apple’s macOS and iOS operating systems.
The architecture consists of the following layers, from top to bottom:
Layered Architecture Overview¶
User Experience Layer:
- This is the software interface that users directly see and interact with.
- macOS uses the Aqua interface, designed for use with a mouse or trackpad.
- iOS uses the Springboard interface, designed for touch screens.
Application Frameworks Layer:
- This layer provides the key Application Programming Interfaces (APIs) for developers.
- It includes the Cocoa framework (for macOS) and Cocoa Touch framework (for iOS).
- These frameworks are used with Objective-C and Swift programming languages.
- The main difference is that Cocoa Touch provides support for mobile-specific hardware features like touch screens.
Core Frameworks Layer:
- This layer contains frameworks that handle graphics and media.
- Examples include QuickTime and OpenGL.
Kernel Environment:
- This is the foundation of the entire operating system, known as Darwin.
- Darwin is a hybrid kernel that includes the Mach microkernel and parts of the BSD UNIX kernel. We will explore Darwin in more detail shortly.
The diagram also shows that applications have flexibility. They can:
- Use the high-level user experience features.
- Bypass the UI and interact directly with the Application or Core Frameworks.
- Bypass all frameworks entirely and communicate directly with the Darwin kernel environment. An example is a simple C program that makes standard POSIX system calls.
Key Differences Between macOS and iOS¶
Despite their shared architecture, there are important distinctions:
Hardware and Architecture:
- macOS is compiled to run on Intel x86 architectures (and now Apple Silicon).
- iOS is compiled for ARM-based architectures used in mobile devices.
Kernel and System Optimizations:
- The iOS kernel has been modified for mobile needs, with a stronger focus on power management and aggressive memory management.
- iOS has more stringent security settings than macOS.
Developer Access:
- iOS is much more restricted. It is a more closed system where developer access to low-level APIs (like POSIX and BSD system calls) is restricted.
- On macOS, these low-level APIs are openly available to developers.
The Darwin Kernel Environment: A Hybrid Structure¶
Go to Figure 2.17: The structure of Darwin.
Darwin is the core kernel environment and is a great example of a hybrid system. It combines the Mach microkernel and the BSD UNIX kernel into a layered, single address space. Its structure includes:
Dual System-Call Interfaces:
- Unlike most OSs that have one system-call interface, Darwin provides two:
- Mach System Calls (also known as Mach traps).
- BSD System Calls (which provide standard POSIX functionality).
- Applications access these calls through a rich set of libraries (C library, networking libraries, etc.).
- Unlike most OSs that have one system-call interface, Darwin provides two:
Mach's Role (The Microkernel Core):
- Mach provides the fundamental OS services:
- Memory Management
- CPU Scheduling
- Interprocess Communication (IPC) via message passing and Remote Procedure Calls (RPCs).
- Mach uses kernel abstractions like tasks (a process), threads, and ports (for IPC).
- Example: When an application uses the BSD
fork()system call to create a process, Mach internally represents it using itstaskabstraction.
- Mach provides the fundamental OS services:
The I/O Kit and Kernel Extensions:
- Darwin provides an I/O Kit for developing device drivers.
- It supports dynamically loadable modules, which macOS calls kernel extensions (kexts).
Solving the Microkernel Performance Problem¶
This is a critical detail. As discussed in Section 2.8.3, a pure microkernel suffers performance overhead because services run in user space, requiring slow message passing and context switches.
Darwin solves this by not being a pure microkernel. Instead, it combines Mach, BSD, the I/O Kit, and all kernel extensions (kexts) into a single address space. This means:
- Mach's services (scheduling, memory management) and BSD's services (file systems, networking) all run in kernel space, not as separate user processes.
- Message passing within Mach still occurs, but because everything is in the same address space, no data copying is required, making it extremely fast.
- The result is a system that has the modular design benefits of a microkernel but the high performance of a monolithic kernel.
Open Source Status¶
- The Darwin operating system is open source. This has allowed third-party projects to add features like the X11 windowing system or new file systems.
- However, the higher-level layers, including the Cocoa/Cocoa Touch application frameworks and other proprietary Apple frameworks, are closed source.
2.8.5.2 Android¶
The Android operating system was created by the Open Handset Alliance, led by Google, for smartphones and tablets. It has two key philosophical differences from iOS:
- It is designed to run on a wide variety of hardware from different manufacturers.
- It is open-source, which has been a major factor in its widespread adoption.
The architecture of Android is a layered software stack, as shown in the diagram.
Go to Figure 2.18: Architecture of Google’s Android.
Let's break down this stack from the top down:
1. Applications¶
- This is the top layer, where all the user-facing apps reside (like the web browser, games, etc.).
2. Android Framework & ART (Android RunTime)¶
This layer is crucial for how apps are built and run.
- Development Language: Apps are primarily written in Java.
- Android API: However, developers do not use the standard Java API. Google provides a separate Android API for Java development.
- Android RunTime (ART): This is the engine that runs the applications. It is a virtual machine (VM), but it is specifically designed for Android and optimized for mobile devices with limited memory and CPU power.
- Compilation Process:
- Java code is first compiled into a standard Java bytecode (
.classfile). - This bytecode is then translated into an Android-executable format called a
.dex(Dalvik Executable) file.
- Java code is first compiled into a standard Java bytecode (
- Ahead-of-Time (AOT) Compilation: This is a key performance feature. Unlike many Java VMs that use Just-In-Time (JIT) compilation (compiling code as it runs), ART compiles the
.dexfiles into native machine code at the time of app installation. This allows for:- More efficient application execution.
- Reduced power consumption.
- Both are critical for mobile battery life.
- Compilation Process:
3. Java Native Interface (JNI)¶
- This is a bridge that allows developers to write Java code that can bypass the ART virtual machine and call native libraries written in C/C++ directly.
- This is used to access specific hardware features or for performance-critical tasks.
- Important Caveat: Programs using JNI are generally not portable across different hardware devices, as they are tied to specific low-level code.
4. Native Libraries¶
- This layer contains core C/C++ libraries that provide essential services to the system and apps. Examples include:
- Webkit: For rendering web pages.
- OpenGL: For 2D and 3D graphics.
- SQLite: A lightweight database engine.
- SSL: For secure network communication.
- Surface Manager: For composing what different apps draw on the screen.
5. Hardware Abstraction Layer (HAL)¶
- This is a key layer for Android's hardware compatibility. Since Android runs on countless devices, the HAL provides a standard interface that abstracts the specific details of the physical hardware (like the camera, GPS, sensors).
- It allows app developers to talk to a consistent software interface without worrying about the underlying hardware driver. This ensures application portability across different devices.
6. Linux Kernel¶
- At the very bottom of the stack is the Linux kernel, which provides the core operating system services.
- However, Google has made significant modifications to the standard Linux kernel to tailor it for mobile needs:
- Power Management: Enhanced features to conserve battery.
- Memory Management: Optimized for memory-constrained devices.
- Interprocess Communication (IPC): Added a new, efficient form of IPC called Binder (covered in Section 3.8.2.1).
- Wakelocks: A mechanism to prevent the device from going to sleep while an app is performing a critical task.
A Special Note: The Bionic C Library¶
- Instead of using the standard GNU C Library (
glibc) from Linux, Android uses its own library called Bionic. - Reasons for this include:
- Smaller Memory Footprint: Designed to be lighter and faster.
- Optimized for Mobile CPUs: Tuned for the typically slower processors in mobile devices.
- Licensing: Allows Google to use a more permissive software license than the GPL license used by
glibc.
Windows Subsystem for Linux (WSL)¶
This section describes a feature in Windows 10 that allows it to run native Linux applications. It's a practical example of Windows's hybrid architecture, which uses subsystems to emulate different operating-system environments.
High-Level Overview¶
Windows 10 includes the Windows Subsystem for Linux (WSL). This subsystem allows native Linux applications (specified as ELF binaries) to run directly on Windows, without the need for a virtual machine.
How a User Interacts with WSL:
A user starts the bash.exe application on Windows. This presents the user with a familiar bash shell, but it's running within the WSL environment.
Internal Architecture and Components¶
Internally, WSL works by creating a special environment. Here are the key components and how they work together:
Linux Instance:
- When
bash.exeis run, WSL creates a Linux instance. - This instance starts with the
initprocess (the traditional first process in Linux), which then creates the/bin/bashshell process.
- When
Pico Processes:
- Each of these Linux processes (like
initandbash) runs inside a special Windows container called a Pico process. - The Pico process's job is to load the native Linux binary into its address space, creating the environment where the Linux application can execute.
- Each of these Linux processes (like
Kernel Services: LXCore and LXSS:
- These Pico processes communicate with two key Windows kernel services:
- LXCore
- LXSS (Linux Subsystem)
- The primary job of these services is to translate Linux system calls from the running Linux application into something the Windows kernel can understand and execute.
- These Pico processes communicate with two key Windows kernel services:
The System Call Translation Process¶
This is the core technical challenge that WSL solves. When a Linux application makes a system call, it must be handled by the Windows kernel. The translation happens as follows:
Case 1: Direct Equivalent Exists
- If the Linux system call has a direct, one-to-one equivalent in Windows, the LXSS service simply forwards the call directly to the corresponding Windows system call. This is the most efficient path.
Case 2: Similar, but Not Identical, System Calls
- Often, Linux and Windows have system calls that are similar but not exactly the same.
- In this situation, LXSS provides part of the functionality itself and then invokes the similar Windows system call to complete the operation.
Case 3: No Windows Equivalent
- If a Linux system call has no equivalent in Windows, the LXSS service must provide the entire functionality itself to emulate the behavior the Linux application expects.
A Concrete Example: The fork() System Call¶
The book uses the fork() call to illustrate Case 2 (similar but not identical calls).
- In Linux,
fork()creates a child process that is an almost identical duplicate of the parent. - In Windows, the closest equivalent is
CreateProcess(), which is designed to create a new process running a different program; it is not built to duplicate an existing process.
How WSL handles fork():
- The LXSS service does the initial work required for forking.
- It then calls the Windows
CreateProcess()system call to handle the remainder of the work. - By combining its own logic with the native Windows call, it successfully emulates the Linux
fork()behavior.
The figure provided illustrates this basic behavior of WSL, showing the flow from the user-mode bash.exe down through the Pico processes and the LXSS/LXCore services to the Windows kernel.
2.9 Building and Booting an Operating System¶
This section explains how an operating system (OS) is created, configured, and started on a computer. While an OS can be built for one specific machine, it is more common to design it to run on a variety of machines with different hardware components.
2.9.1 Operating-System Generation¶
When you buy a computer, it usually comes with an OS like Windows or macOS already installed. However, there are situations where you need to install or build an OS yourself, such as:
- Replacing the pre-installed OS.
- Adding another OS (dual-booting).
- Using a computer that was sold without an OS.
If you are building an operating system from the beginning, you must follow these five steps:
- Write (or Obtain) the Source Code: You need the human-readable code that defines the OS. This can be code you write yourself or existing code, like the Linux kernel.
- Configure the OS: You must adapt the OS for the specific hardware it will run on. Settings are saved in a configuration file.
- Compile the OS: The human-readable source code is translated into machine code that the computer's processor can execute.
- Install the OS: The compiled OS is placed onto the computer's hard drive or storage device.
- Boot the Computer: The computer is started, and control is handed over to the newly installed operating system.
Levels of System Configuration and Generation
The process of system generation can be done at different levels of customization:
Full System Build (Most Tailored):
- A system administrator modifies the OS source code directly based on the configuration file.
- The entire operating system is then compiled from this modified source code.
- The result is a completely customized OS built specifically for that exact hardware configuration. This is common for embedded systems with static, unchanging hardware.
Linking Precompiled Modules (Common Compromise):
- The configuration file is used to select necessary parts from a library of precompiled object modules (like device drivers for different I/O devices).
- These pre-made modules are then linked together to form the final OS.
- This is faster than a full compile but may result in a larger, more general system than strictly necessary. Most modern OSs for desktops, laptops, and mobile devices use this approach, often with loadable kernel modules to add support for new hardware dynamically.
Modular System (Most Flexible):
- The system is built from modules that are selected and loaded at execution (run) time.
- System generation here simply involves setting configuration parameters. The system adapts as it runs.
The main differences between these approaches are the size of the final OS, its generality, and how easy it is to change when the hardware changes.
Building a Linux System from Scratch: A Step-by-Step Example
To illustrate a full system build, here is a typical process for building a Linux kernel:
- Download Source Code: Get the Linux source code from a site like
http://www.kernel.org. - Configure the Kernel: Run the command
make menuconfig. This command opens a menu to select which features and drivers to include. Your choices are saved into a file named.config. - Compile the Kernel: Run the
makecommand. This command reads the.configfile and compiles the core part of the operating system, producing the kernel image file namedvmlinuz. - Compile Kernel Modules: Run the
make modulescommand. This compiles the optional, loadable parts of the kernel (the modules), also based on the.configfile. - Install the Modules: Run
make modules_install. This command copies the compiled modules into the correct location so thevmlinuzkernel can find and use them. - Install the New Kernel: Run
make install. This command installs the newvmlinuzkernel and related files onto the system. When the computer reboots, it will run this new operating system.
Alternative: Using a Virtual Machine
Instead of replacing your main OS, you can install Linux inside a virtual machine (VM). A VM allows you to run an operating system (the "guest") on top of your existing OS (the "host"), such as running Linux on a Windows or macOS computer. (Refer to Section 1.7 and Chapter 18 for more on virtualization).
There are two main ways to do this:
- Build from Scratch in a VM: The process is similar to the steps above, but you perform them within the virtual machine's environment.
- Use a Pre-built Virtual Machine Appliance: This is a faster method. You download a ready-to-run OS image (an appliance) and install it using virtualization software like VirtualBox or VMware.
Example of using a pre-built appliance: The authors of the textbook created their provided VM by:
- Downloading an Ubuntu ISO image (a disk image file) from
https://www.ubuntu.com/. - Configuring VirtualBox to use the downloaded ISO file as a virtual bootable DVD.
- Starting (booting) the virtual machine, answering the installation questions, and letting the installer set up the OS inside the VM.
2.9.2 System Boot¶
This section explains how the computer starts up and loads the operating system kernel into memory after it has been generated and installed. This process is known as booting.
The Boot Process Overview¶
Once the operating system is built, the hardware needs to find and start it. The boot process typically follows these steps:
- Bootstrap Program: A small program called the bootstrap program or boot loader finds the kernel on the storage device.
- Load Kernel: The boot loader loads the kernel into the computer's main memory (RAM) and starts its execution.
- Initialize Hardware: The kernel takes over and initializes all the hardware components of the computer.
- Mount Root File System: The kernel mounts the root file system, which is the primary directory that contains all the system files and directories.
Detailed Boot Process: BIOS vs. UEFI¶
Some systems use a multistage boot process:
Traditional BIOS (Basic Input/Output System) Boot:
- When the computer powers on, a very small boot loader located in non-volatile firmware called the BIOS runs first.
- This initial loader usually just loads a second, more capable boot loader from a fixed location on the disk called the boot block.
- The code in the boot block is often simple because it must fit in a single disk block. Its main job is to know where to find the rest of the bootstrap program and the operating system kernel.
Modern UEFI (Unified Extensible Firmware Interface) Boot:
- Many newer systems have replaced BIOS with UEFI.
- UEFI has advantages over BIOS, including better support for 64-bit systems and larger disks.
- A key advantage is that UEFI is a single, complete boot manager, making the boot process faster than the multi-stage BIOS approach.
The Role of the Bootstrap Program¶
Whether using BIOS or UEFI, the main bootstrap program (like GRUB for Linux) performs several critical tasks:
- Loads the Kernel: It finds the file containing the kernel program and loads it into memory.
- Runs Diagnostics: It checks the state of the machine (e.g., inspecting memory, the CPU, and discovering connected devices).
- Initializes the System: It initializes CPU registers, device controllers, and the contents of main memory.
- Starts the OS: It finally starts the operating system's execution and mounts the root file system. Only after this point is the system considered to be running.
GRUB Example:
GRUB is a common, open-source bootstrap program. Its configuration file, loaded at startup, specifies boot parameters. For example, parameters from a Linux file called /proc/cmdline might look like this:
BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=5f2e2232-4e47-4fe8-ae94-45ea749a5c92
BOOT_IMAGEspecifies the name of the kernel image file to be loaded into memory.rootspecifies the unique identifier (UUID) of the storage partition that contains the root file system.
The Linux Boot Process in More Detail¶
The Linux boot process involves some specific steps to be efficient:
- Compressed Kernel: The Linux kernel image is stored as a compressed file to save space. It is decompressed after being loaded into memory by the boot loader.
- Initial RAM File System (initramfs): The boot loader creates a temporary root file system in RAM called initramfs. This temporary file system contains essential drivers and kernel modules needed to access the real root file system (which is on a physical disk, not in memory).
- Switching Root File Systems: Once the kernel has started and the necessary drivers from
initramfsare installed, the kernel switches from the temporary RAM file system to the real root file system on the disk. - Starting Services: Finally, Linux creates the first process,
systemd, which then starts all other system services (like a web server or database). The system then presents the user with a login prompt.
Note: The boot mechanism is tied to the boot loader. This means there are specific versions of boot loaders like GRUB for BIOS and for UEFI, and the firmware (BIOS/UEFI) must know which one to use.
Booting on Mobile Systems (e.g., Android)¶
The boot process for mobile systems like Android is slightly different from traditional PCs:
- Boot Loader: Android, though Linux-based, does not use GRUB. Vendors provide their own boot loaders, with LK ("little kernel") being a common one.
- Kernel and initramfs: Android uses the same compressed kernel image and initial RAM file system as Linux.
- Permanent initramfs: A key difference is that Android keeps the
initramfsas its permanent root file system, whereas Linux discards it after use. - Startup: After loading the kernel and mounting the root file system (
initramfs), Android starts theinitprocess, creates several services, and then displays the device's home screen.
Recovery and Diagnostic Booting¶
Most operating systems (Windows, Linux, macOS, iOS, Android) provide a way to boot into a special recovery mode or single-user mode. This mode is used for:
- Diagnosing hardware issues.
- Fixing corrupt file systems.
- Reinstalling the operating system.
This capability is crucial for troubleshooting problems that prevent the system from booting normally. The following section (2.10) will cover other system issues like software errors and performance problems.
2.10 Operating-System Debugging¶
Debugging is the process of finding and fixing errors in a system, including both hardware and software. It also includes addressing performance problems, which are considered a type of bug. Improving performance by identifying and removing bottlenecks is known as performance tuning. This section focuses on debugging errors in processes and the operating system kernel, as well as performance issues. Hardware debugging is not covered here.
2.10.1 Failure Analysis¶
When a software component fails, the operating system has methods to capture information about the failure to aid in debugging.
Debugging User Processes:
- Log Files: If a user process fails, most operating systems write error information to a log file. This alerts system administrators or users about the problem.
- Core Dump: The operating system can also capture the contents of the process's memory at the time of failure. This capture is called a core dump (a term from early computing where memory was referred to as "core") and is stored in a file.
- Debugger: Both running programs and saved core dumps can be analyzed using a debugger. This tool allows a programmer to examine the code and memory state of the process at the exact moment it failed, helping to identify the root cause.
Debugging the Operating System Kernel:
Debugging the kernel is significantly more complex than debugging a user process due to:
- The kernel's large size and complexity.
- Its direct control over the hardware.
- The fact that standard user-level debugging tools are not available or cannot be used.
A kernel failure is known as a crash. The procedure for handling a kernel crash is as follows:
- Log Information: Error information is saved to a log file.
- Crash Dump: The state of the system's memory is saved to a crash dump.
The Challenge of Kernel Crash Dumps: Saving a kernel crash dump is risky. If the crash is due to a failure in the file-system code, the kernel cannot safely save the memory state to a regular file on that potentially corrupted file system.
The solution to this problem is a special technique:
- A section of the disk is set aside that contains no file system. This area is reserved specifically for saving crash dumps.
- When the kernel detects an unrecoverable error, it writes the entire contents of memory (or at least the parts owned by the kernel) directly to this reserved disk area.
- When the system reboots, a special process runs. This process gathers the data from the reserved disk area and writes it to a proper crash dump file within a file system, where it can be safely analyzed.
This complex strategy is necessary for kernel debugging but is not required for debugging ordinary user-level processes.
2.10.2 Performance Monitoring and Tuning¶
Performance tuning aims to improve system performance by finding and eliminating processing bottlenecks. To find these bottlenecks, you must first be able to monitor how the system is performing. Therefore, the operating system must provide ways to measure and display system behavior.
Performance monitoring tools can be characterized in two ways:
- By Scope: They can provide per-process observations or system-wide observations.
- By Method: They use one of two main approaches: counters or tracing.
2.10.2.1 Counters¶
Operating systems track activity using a series of counters. These counters keep a running tally of events, such as the number of system calls made, the number of disk operations, or the amount of network packets sent.
Here are examples of Linux tools that use counters, categorized by their scope:
Per-Process Tools (Focus on individual programs):
ps: Reports static information for a single process or a selected group of processes (e.g., their ID, status, and resource usage).top: Provides a real-time, continuously updated view of system statistics, with a focus on current processes and their consumption of resources like CPU and memory.
System-Wide Tools (Focus on the entire system):
vmstat: Reports virtual memory statistics, providing data about memory, paging, and block I/O.netstat: Reports statistics for network interfaces and connections.iostat: Reports I/O (Input/Output) usage and statistics for disks and other storage devices.
The /proc File System in Linux
Most counter-based tools on Linux systems get their data from the /proc file system.
/procis a "pseudo" file system; it does not exist on a physical disk. It is created dynamically by the kernel in memory when the system boots.- Its primary purpose is to provide an interface for querying various per-process and kernel statistics.
- It is organized as a directory hierarchy. Each running process has a subdirectory named after its unique Process ID (PID). For example, the directory
/proc/2155contains all the statistical information for the process with the ID 2155. - There are also entries in
/procfor various kernel statistics.
(Refer to Figure 2.19 for the Windows 10 task manager).
Windows Performance Monitoring
Windows systems provide the Windows Task Manager. This tool offers information on:
- Current applications
- Running processes
- CPU usage
- Memory usage
- Networking statistics
The Windows Task Manager is a graphical tool that provides counter-based monitoring similar to the Linux command-line tools.
2.10.3 Tracing¶
Tracing is a different approach from using counters. While counter-based tools simply check the current value of statistics the kernel keeps, tracing tools collect detailed data about specific events as they happen. This allows you to see the step-by-step execution, for example, of a system-call invocation.
Here are examples of Linux tracing tools, categorized by their scope:
Per-Process Tools (Focus on individual programs):
strace: Traces the system calls made by a process. It shows each call to the kernel and its result.gdb: The GNU Debugger. This is a source-level debugger that allows a programmer to step through a program's execution line by line, examine variables, and analyze its logic.
System-Wide Tools (Focus on the entire system):
perf: A comprehensive collection of Linux performance tools that can trace a wide variety of CPU and system-related events.tcpdump: A powerful command-line tool that captures and analyzes network packets passing through the system.
Kernighan’s Law
The text highlights a famous principle in computer science: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
This law emphasizes that writing clear and understandable code is crucial. Overly complex or "clever" code becomes extremely difficult to fix when problems arise.
Modern Tracing Tools
Making operating systems easier to understand, debug, and tune is an active area of development. A new generation of tools has made significant progress. One such toolkit is BCC (BPF Compiler Collection), which is used for dynamic kernel tracing in Linux. It allows for sophisticated and efficient tracing and analysis of the kernel's behavior in real-time.
2.10.4 BCC¶
Debugging the interactions between user-level code and kernel code is extremely difficult without a specialized toolset. For a toolset to be truly useful for this task, it must meet several demanding requirements:
- It must be able to debug any area of the system, including parts not originally written with debugging in mind.
- It must perform this debugging without affecting system reliability.
- It must have a minimal performance impact. Ideally, it should have no impact when not in use, and the impact during use should be proportional to the amount of data being collected.
The BCC toolkit is designed to meet these requirements, providing a dynamic, secure, and low-impact debugging environment.
What is BCC?¶
BCC (BPF Compiler Collection) is a rich toolkit that provides advanced tracing features for Linux systems. BCC acts as a front-end interface to a powerful underlying technology called eBPF (extended Berkeley Packet Filter).
The Foundation: eBPF
- The original BPF (Berkeley Packet Filter) technology was developed in the early 1990s for filtering network traffic.
- eBPF (extended BPF) is a modern version that added many new features, transforming it from a simple packet filter into a general-purpose, in-kernel virtual machine.
- Programs for eBPF are written in a restricted subset of the C programming language and are compiled into special eBPF bytecode instructions.
- This eBPF bytecode can be dynamically inserted into a running Linux kernel without needing to reboot or load kernel modules.
- eBPF instructions can be used to capture specific events (like a particular system call being invoked) or to monitor system performance (like measuring the time required for disk I/O).
- To ensure safety and stability, all eBPF programs are passed through a verifier before they are inserted into the kernel. This verifier checks the code to ensure it will not crash the kernel, create infinite loops, or compromise system security.
How BCC Simplifies eBPF¶
Although eBPF is powerful, writing programs directly using its C interface has traditionally been very difficult. BCC was created to solve this problem by providing a simpler, higher-level front-end interface in Python.
The structure of a BCC tool is as follows:
- The main tool is written in Python.
- This Python code embeds C code that interfaces with the eBPF instrumentation.
- The embedded C code, in turn, interfaces directly with the kernel.
- The BCC tool automatically compiles the C program into eBPF instructions and inserts it into the kernel using techniques like probes or tracepoints, which are methods for hooking into specific events in the Linux kernel.
Using Pre-Built BCC Tools¶
Writing custom BCC tools is complex, but the BCC package (which is installed on the provided Linux virtual machine) comes with many ready-to-use tools that monitor various areas of a running kernel.
Example: disksnoop.py
This tool traces disk I/O activity. Entering the command ./disksnoop.py generates output like this:
TIME(s) T BYTES LAT(ms)
1946.29186700 R 8 0.27
1946.33965000 R 8 0.26
1948.34585000 R 8192 0.96
1950.43251000 W 4096 0.56
1951.74121000 R 4096 0.35
This output provides detailed information for each I/O operation:
- TIME(s): The timestamp when the operation occurred.
- T: The type of operation (Read or Write).
- BYTES: The number of bytes involved in the I/O operation.
- LAT(ms): The duration, or latency, of the I/O operation in milliseconds.
Targeted Tracing
BCC tools can be focused on specific applications or processes. For example, the command ./opensnoop -p 1225 will trace only the open() system calls made by the process with the Process ID 1225. Many BCC tools are also designed for specific applications like MySQL databases, Java, and Python programs.
The Power of BCC¶
What makes BCC especially powerful is that its tools can be safely used on live production systems that are running critical applications. This allows system administrators to monitor system performance in real-time to identify bottlenecks or security exploits without causing harm to the system.
(Refer to Figure 2.20 for an illustration of the wide range of tools provided by BCC and eBPF, showing their ability to trace essentially any area of the Linux operating system.)
BCC is a rapidly evolving technology, with new tools and features being added constantly.
2.11 Summary¶
An operating system provides an environment for program execution by offering services to both users and programs.
There are three primary ways to interact with an operating system:
- Command Interpreters (CLI - Command-Line Interface)
- Graphical User Interfaces (GUI)
- Touch-Screen Interfaces
System calls provide the interface to the services made available by the operating system. Programmers use a system call's Application Programming Interface (API) to access these services.
System calls can be divided into six major categories:
- Process Control
- File Management
- Device Management
- Information Maintenance
- Communications
- Protection
The standard C library provides the system-call interface for UNIX and Linux systems.
Operating systems include a collection of system programs that provide utilities to users.
A linker combines several relocatable object modules into a single binary executable file. A loader then loads this executable file into memory, where it becomes eligible to run on an available CPU.
Applications are operating-system specific for several reasons, including:
- Different binary formats for program executables.
- Different instruction sets for different CPUs.
- System calls that vary from one operating system to another.
An operating system is designed with specific goals in mind. These goals determine the operating system's policies, which are implemented through specific mechanisms.
A monolithic operating system has no internal structure; all functionality is provided in a single, static binary file that runs in a single address space. Its primary benefit is efficiency, but it is difficult to modify.
A layered operating system is divided into a number of discrete layers. The bottom layer is the hardware interface, and the highest layer is the user interface. This approach is generally not ideal for operating systems due to performance problems.
The microkernel approach uses a minimal kernel, with most services running as user-level applications. Communication takes place via message passing.
A modular approach provides operating-system services through modules that can be loaded and removed during run time. Many contemporary operating systems are hybrid systems, using a combination of a monolithic kernel and modules.
A boot loader loads the operating system into memory, performs initialization, and begins system execution.
Operating system performance can be monitored using two main methods:
- Counters: A collection of system-wide or per-process statistics.
- Tracing: Following the execution of a program through the operating system to see each step.
Chapter 3: Processes¶
To understand why processes are so important, we need to look at how computer systems evolved.
From Single-Tasking to Multitasking¶
Early computers were very simple. They allowed only one program to be executed at a time. When that program ran, it had complete control over the entire system—the CPU, memory, and all input/output devices. There was no operating system in the way we think of it today to manage resources or provide protection.
Contemporary computer systems are completely different. They allow multiple programs to be loaded into memory and executed concurrently. This shift from single-tasking to multitasking created a major problem: how do you stop these programs from interfering with each other? What if one program crashes? Should it take down the entire system? What if a program has an infinite loop? Should it monopolize the CPU forever?
The Process as the Solution¶
The answer to these problems was the creation of the process. The need for "firmer control and more compartmentalization" of programs led directly to this concept. By defining each running program as a distinct process, the operating system can:
- Isolate them from one another.
- Manage the resources (CPU time, memory) allocated to each one.
- Protect the system and other programs from any single misbehaving program.
This is why a process is called the unit of work in a modern computing system. The OS doesn't manage "programs"; it manages "processes."
System Processes and User Processes¶
A modern operating system is more than just a program launcher. It has many internal jobs to do, like managing memory, scheduling tasks, and handling network traffic. The text makes a key distinction:
- User Processes: These are the processes executing your code, like your web browser or word processor.
- System Processes: These are processes that execute operating system code. However, for stability and design reasons, not all OS code runs in the privileged kernel mode. Much of it runs as separate processes in user space.
Therefore, a running system is actually a collection of processes—a mix of user applications and system utilities—all working together.
The Illusion of Concurrency¶
The text states that "all these processes can execute concurrently, with the CPU (or CPUs) multiplexed among them." Let's break down what this means for your computer architecture background:
- On a single-core CPU, only one process can actually be executing an instruction at any single moment. The CPU is rapidly switched between all the active processes. This is called time-sharing or multiplexing. Each process gets a small slice of CPU time (a few milliseconds), making it appear as if all processes are running simultaneously.
- On a multi-core CPU, true parallelism is possible, where multiple processes actually run at the exact same time, each on its own core.
In both scenarios, the operating system's process scheduler is the component responsible for deciding which process runs next on which CPU core. This will be covered in more detail later in the chapter.
3.1 Process Concept¶
3.1.1 What is a Process?¶
Think about everything your computer is doing right now. You might have a web browser, a music player, and a code editor all running at the same time. From the operating system's perspective, each of these running applications is a process.
Historically, the term "job" was used, especially in early batch systems. While "process" is the more modern and precise term, you will still encounter "job" in contexts like "job scheduling" because it's deeply rooted in operating system history and theory.
The Core Definition: A process is a program in execution. It's a live, active entity. To understand what a process is, we need to look at what makes it up.
The Components of a Process in Memory¶
A process isn't just the code; it's the entire context needed to run that code. When a program is loaded into memory to become a process, its memory layout is divided into several sections. Go to Figure 3.1 to see the typical layout of a process in memory.
Let's break down each section from the figure:
- Text Section: This is the actual executable code of the program. It's the machine instructions that the CPU follows. This section is read-only to prevent a program from accidentally modifying its own instructions.
- Data Section: This section holds global variables. These are variables that are defined outside of any function and exist for the lifetime of the program.
- Heap Section: This is memory that is dynamically allocated during program run time. When you use commands like
malloc()in C ornewin C++/Java, you are requesting memory from the heap. The heap grows upwards (towards higher memory addresses). - Stack Section: The stack is used for temporary data storage when invoking functions. Each time a function is called, an activation record (or stack frame) is pushed onto the stack. This frame contains:
- Function parameters
- Return addresses (where to go back to when the function is done)
- Local variables When the function returns, its activation record is popped off the stack. The stack grows downwards (towards lower memory addresses).
Dynamic Behavior: The sizes of the text and data sections are fixed when the program starts. However, the stack and heap can shrink and grow dynamically during execution. As you call more functions, the stack grows. As you allocate more memory, the heap grows. The operating system must ensure that these two sections never collide as they grow towards each other.
Crucial Distinction: Program vs. Process¶
This is a fundamental concept:
- A program is a passive entity. It's just a file on your disk (an executable file) containing a list of instructions. It does nothing by itself.
- A process is an active entity. It has a program counter (which tells it the next instruction to execute) and a set of associated resources (like memory, CPU registers, and open files).
A program becomes a process when that executable file is loaded into memory. This happens when you double-click its icon or run it from the command line (e.g., a.out).
Multiple Processes from One Program¶
It's possible to have multiple processes that are all instances of the same program. For example, if you open three different terminal windows, you have three separate processes running the shell program. Even though the text section (the executable code) is identical for all of them, each process has its own, separate data, heap, and stack sections. This is why one crashed web browser tab doesn't necessarily crash the entire browser—each tab is often a separate process.
Processes as Execution Environments¶
A process can itself be a platform for running other code. A perfect example is the Java Virtual Machine (JVM).
When you run java Program, the operating system creates a process for the JVM. This JVM process then loads your Program.class file and interprets its instructions, executing them on your behalf using native machine instructions. The JVM process is the real process the OS manages, and your Java program runs within this controlled environment.
3.1.2 Process State¶
A process is not a static entity; it is dynamic. As it executes, its current activity changes, and we describe this using its state. The state of a process is defined by what it is currently doing. A process can be in one of several distinct states.
The Five Process States¶
Here are the five fundamental states a process can be in:
- New: The process is in the middle of being created. The operating system is setting up its process control block (we'll cover this later) and loading its program into memory.
- Running: Instructions are being executed on a CPU core. This is the state where the process is actively using the processor.
- Waiting (or Blocked): The process cannot proceed because it is waiting for an external event to occur. Common examples include:
- Waiting for user input (like a keyboard press).
- Waiting for data to be read from or written to a disk (I/O operation).
- Waiting for a signal from another process.
- Ready: The process is loaded in memory and is capable of executing, but the operating system has not assigned it to a CPU core yet. It is patiently waiting for its turn to run. There is often a ready queue where all ready processes reside.
- Terminated: The process has finished executing. The operating system is in the process of cleaning up its resources (freeing memory, etc.).
Crucial Point: On a system with a single CPU core, only one process can be in the "Running" state at any given instant. However, many processes can simultaneously be in the "Ready" and "Waiting" states.
The Process State Diagram¶
The transitions between these states are best visualized with a diagram. Go to Figure 3.2 to see the process state transition diagram. Let's walk through what each arrow means:
- New -> Ready: When the OS finishes creating the process, it moves it from "New" to the "Ready" queue. This transition is called admit.
- Ready -> Running: When the operating system's scheduler decides it's time for a process to run, it selects a process from the "Ready" state and assigns it to a CPU. This transition is called dispatch.
- Running -> Ready: A running process can be forced back to the "Ready" state for two main reasons, often related to interrupts:
- A timer interrupt occurs, meaning the process has used up its maximum allowed time slice (its "turn" on the CPU). This prevents one process from hogging the CPU.
- A higher-priority process becomes ready and preempts the current one.
- Running -> Waiting: A process voluntarily moves itself to the "Waiting" state when it requests something that takes time and doesn't need the CPU, like an I/O operation (e.g., reading a file). The process is said to issue a wait for that event.
- Waiting -> Ready: When the event the process was waiting for finally occurs (e.g., the disk read completes), the process moves from "Waiting" back to the "Ready" queue. This is an I/O or event completion.
- Running -> Terminated: A process moves to the "Terminated" state when it finishes executing its final instruction (voluntarily) or is forcibly killed (involuntarily). It then exits.
Memory Layout of a C Program (A Closer Look)¶
The previous section gave a general overview of a process's memory layout. Now, let's see how this maps directly to a real C program. The text provides a specific diagram for a C program's memory layout, which refines the general picture from Figure 3.1.
The key differences for a C program are:
- The data section is often split into two parts:
- Initialized Data: Contains global variables that were given a starting value (e.g.,
int y = 15;). - Uninitialized Data (often called
bss): Contains global variables that were not explicitly initialized (e.g.,int x;). The OS initializes these to zero.
- Initialized Data: Contains global variables that were given a starting value (e.g.,
- The stack also includes space for the command-line arguments
argcandargvpassed to themain()function.
Using the size Command:
You can inspect the sizes of these sections in a compiled program using the size command. For a program named memory, running size memory might output:
text data bss dec hex filename
1158 284 8 1450 5aa memory
- text: The size of the code section (1158 bytes).
- data: The size of the initialized data section (284 bytes).
- bss: The size of the uninitialized data section (8 bytes). The term
bssis a historical acronym for "block started by symbol." - dec/hex: The sum of the three sections in decimal (1450) and hexadecimal (5aa).
3.1.3 Process Control Block¶
Now that we know what a process is and the states it can be in, a critical question arises: How does the operating system keep track of all this information for every single process? The answer is the Process Control Block (PCB), which is sometimes also called a task control block.
Think of the PCB as a process's resume or ID card within the operating system. Each process has exactly one PCB, and the OS uses these PCBs to manage and control all processes. Go to Figure 3.3 to see a representation of a PCB.
The PCB is a data structure that contains all the information the OS needs to know about a specific process. Let's break down its essential components:
Components of the Process Control Block¶
Process State: This field stores the current state of the process (e.g., running, ready, waiting, etc.), as defined in the previous section.
Program Counter: This is a crucial piece of information. The PC holds the memory address of the next instruction to be executed by this process. When the process is switched out and later resumed, the OS uses this saved address to know where to continue from.
CPU Registers: This is a set of all the processor registers that the process was using (e.g., accumulators, index registers, stack pointers, general-purpose registers). Like the program counter, these must be saved when the OS interrupts a process so that its entire CPU context can be restored perfectly when it runs again. Together, the Program Counter and CPU Registers form the hardware state of the process.
CPU-Scheduling Information: This is the data the OS scheduler uses to decide which process runs next. It includes:
- Process priority
- Pointers to link this PCB to various scheduling queues (like the ready queue)
- Any other scheduling parameters (we will cover these in detail in Chapter 5).
Memory-Management Information: This section describes the layout of the process in memory. Its contents depend on the memory system used by the OS (covered in Chapter 9), but it can include:
- The values of base and limit registers (in simple systems).
- Pointers to page tables or segment tables (in more advanced virtual memory systems).
Accounting Information: This is a collection of data for resource usage tracking and billing. It can include:
- The amount of CPU time the process has used.
- The total real (wall-clock) time since it started.
- Time limits.
- Job or process numbers.
- Account numbers (for multi-user systems).
I/O Status Information: This section lists all the resources the process is using for input and output. It includes:
- The list of I/O devices allocated to the process (e.g., a specific printer).
- A list of the process's open files.
The Role of the PCB¶
In summary, the PCB is the repository for all the data needed to start, stop, and restart a process. When the OS performs a context switch (changing the CPU from one process to another), it must perform these key steps:
- Save the state of the current running process (its Program Counter and CPU Registers) into its PCB.
- Load the state of the new process (from its PCB) into the CPU registers and set the program counter.
This mechanism allows the OS to give the illusion of concurrent execution by transparently switching between processes, using their PCBs as bookmarks.
3.1.4 Threads¶
The traditional process model we've discussed so far assumes that a process is a single, sequential flow of execution, known as a thread. Think of it as a single path of instructions that the CPU follows.
The Limitation of a Single Thread¶
In this model, a process can only do one thing at a time. For example, if a word processor is a single-threaded process, it can either:
- Wait for you to type characters, or
- Run the spell checker.
It cannot do both simultaneously. If it starts the spell checker, your user interface might freeze until it completes. This is a significant limitation for modern, interactive applications.
The Solution: Multithreading¶
Most modern operating systems extend the process concept to allow a single process to contain multiple threads of execution. A thread (sometimes called a lightweight process) is a basic unit of CPU utilization within a process. Each thread has its own:
- Thread ID
- Program Counter
- Register set
- Stack
However, all threads within a single process share the same code, data, and other OS resources, such as open files and memory space.
This allows a process to perform more than one task at the same time. A multithreaded word processor could, for instance:
- Assign one thread to manage user input (keystrokes and mouse clicks).
- Assign a second thread to run the spell checker in the background.
- Assign a third thread to handle auto-saving documents at regular intervals.
This concurrency keeps the user interface responsive even while background tasks are running.
Threads and Hardware¶
This feature is especially powerful on multicore systems, where the operating system can assign different threads from the same process to different CPU cores, allowing them to run in genuine parallel. Even on a single-core CPU, multithreading provides benefits by allowing the CPU to switch between tasks quickly, creating the illusion of simultaneity.
Impact on the Process Control Block (PCB)¶
On systems that support threads, the PCB structure we discussed earlier must be expanded. Instead of having just one set of CPU registers and one program counter, the PCB now needs to manage multiple such sets—one for each thread belonging to the process. The PCB becomes a container for the process's shared resources, while also holding or pointing to the individual Thread Control Blocks (TCBs) for each thread.
This requires significant changes throughout the operating system to manage and schedule threads efficiently. Chapter 4 will explore threads in much greater detail.
3.2 Process Scheduling¶
The core idea behind modern operating systems is to keep the CPU as busy as possible. Two key concepts enable this:
- Multiprogramming: The objective is to always have some process running at all times to maximize CPU utilization. When one process needs to wait for I/O (like a disk read), the OS switches to another process that is ready to run, so the CPU is never idle.
- Time Sharing: The objective is to switch the CPU core among processes so frequently that users can interact with each program while it is running. This creates the illusion that multiple programs are executing simultaneously, providing responsiveness.
The component of the operating system that makes this possible is the process scheduler. Its job is to select an available process from a set of ready processes and assign it to a CPU core for execution.
A fundamental rule is: Each CPU core can run only one process at a time.
Process Scheduling Queues¶
To manage which process runs next, the OS uses queues. The two most important queues are illustrated in Figure 3.4.
- The Ready Queue: This is a queue of all processes that are in the ready state (loaded in memory and ready to execute, just waiting for a CPU core). The PCBs of these processes are typically linked together in this queue. The scheduler picks the next process to run from the head of this queue.
- Wait Queues (or Device Queues): When a process issues an I/O request, it is moved from the running state to a waiting state and its PCB is placed on a wait queue for that specific device (e.g., a disk queue, a keyboard queue). When the I/O operation completes, the process is moved back to the ready queue.
Degree of Multiprogramming¶
The number of processes currently residing in memory (and thus capable of being scheduled) is known as the degree of multiprogramming.
- On a single-core CPU, only one process can be running at any instant, but multiple processes can be in the ready and waiting states.
- On a multicore system, the number of running processes can be up to the number of cores.
- If there are more ready processes than available cores, the excess processes must wait in the ready queue until a core is free.
Process Behavior: I/O-bound vs. CPU-bound¶
An effective scheduler must also consider the general behavior of a process, which generally falls into one of two categories:
- I/O-bound Process: This type of process spends more of its time doing I/O operations than computations. These processes typically run for only short bursts before they need to wait for an I/O request to complete (e.g., a text editor waiting for user input, a web server waiting for a network packet). I/O-bound processes typically have many short CPU bursts.
- CPU-bound Process: This type of process generates I/O requests infrequently and spends more of its time performing computations (e.g., a scientific calculation, a complex video encoding task). CPU-bound processes typically have few, but very long, CPU bursts.
A good scheduling system maintains a balance: it must give quick, responsive service to I/O-bound processes to keep them from waiting too long, while also ensuring that CPU-bound processes make steady progress.
Process Representation in Linux¶
In the Linux operating system, the Process Control Block (PCB) is implemented as a C structure named task_struct. You can find this defined in the kernel source code in the file <include/linux/sched.h>. This single structure contains all the information the kernel needs to manage a process.
Some of the key fields inside the task_struct include:
long state;// stores the current state of the process (running, waiting, etc.)struct sched_entity se;// contains scheduling information for this processstruct task_struct *parent;// a pointer to the PCB of the process that created this onestruct list_head children;// a list of the PCBs of processes created by this onestruct files_struct *files;// a list of open files for this processstruct mm_struct *mm;// memory management information (like page tables)
The kernel keeps track of all active processes by linking their task_struct PCBs together in a doubly linked list. It also maintains a special pointer, current, which points to the task_struct of the process currently executing on the CPU.
For example, if the kernel needs to change the state of the currently running process, it can simply execute:
current->state = new_state;
3.2.1 Scheduling Queues¶
To manage the many processes in various states, the operating system uses a series of queues. Think of these as waiting lines for different resources.
The Ready Queue¶
As processes enter the system, they are placed into the ready queue. This queue holds all processes that are loaded in memory and ready to execute, but are waiting for their turn on a CPU core.
- Implementation: This queue is typically stored as a linked list of Process Control Blocks (PCBs).
- Structure: There is a ready-queue header that contains a pointer to the first PCB in the list. Each PCB in the list has a pointer field that points to the next PCB in the ready queue.
Wait Queues¶
The system also includes other queues, known collectively as wait queues (or device queues). A process is moved to a wait queue when it cannot continue execution until a specific event occurs.
- Common Reason: The most common reason is an I/O request. For example, if a process needs to read data from a disk, it must wait because devices like disks run much slower than the CPU. Rather than letting the CPU sit idle, the OS moves the process to the appropriate I/O wait queue until the data is available.
The Queueing Diagram¶
The flow of processes through these queues is best visualized with a queueing diagram. Go to Figure 3.5 for a representation of process scheduling.
In this diagram:
- The rectangles represent the queues (the ready queue and various wait queues).
- The circles represent the resources that serve the queues (the CPU and I/O devices).
- The arrows indicate the flow of processes from one state to another.
The Lifecycle of a Process in the Queues¶
Let's trace the path a process takes through the system, following Figure 3.5:
A new process is initially put into the ready queue.
It waits there until it is dispatched (selected) by the CPU scheduler and assigned to a CPU core.
Once the process is executing on the CPU, one of several events can occur:
- The process issues an I/O request: It is immediately moved from the CPU to the relevant I/O wait queue. (Path: CPU -> "I/O request" -> I/O wait queue)
- The process creates a child process and waits for it: It is moved to a wait queue until the child process terminates. (Path: CPU -> "create child process" -> child termination wait queue)
- An interrupt occurs or the time slice expires: The process is forcibly removed from the CPU and placed back into the ready queue to wait for its next turn. (Path: CPU -> "time slice expired" -> ready queue)
For processes in wait queues (the first two cases), the transition back to readiness happens when the event they were waiting for occurs:
- An I/O operation completes, moving the process from the I/O wait queue back to the ready queue. (Path: I/O wait queue -> "I/O" -> ready queue)
- A child process terminates, moving the waiting parent from the child termination wait queue back to the ready queue.
This cycle of moving between the ready queue, the CPU, and various wait queues continues for the entire life of the process.
- When the process terminates, it is removed from all queues, and its PCB and all other resources are deallocated by the operating system.
3.2.2 CPU Scheduling¶
We've seen how processes move between queues. The central component that manages the "ready queue -> CPU" transition is the CPU scheduler (often just called the scheduler).
The Role of the CPU Scheduler¶
The core responsibility of the CPU scheduler is to select one process from the ready queue and allocate a CPU core to it. This decision is called a scheduling decision.
This selection process needs to happen very frequently for two key reasons:
- I/O-bound Processes: These processes may only execute for a few milliseconds before they issue an I/O request and are moved to a wait queue. The scheduler must be ready to quickly pick a new process to run.
- Time Sharing and Fairness: Even a CPU-bound process is not allowed to keep the CPU for as long as it wants. To maintain system responsiveness and fairness, the operating system uses timer interrupts to forcibly remove the CPU from a running process after a specific time interval (called a time slice or quantum). When this time slice expires, the process is moved back to the ready queue, and the scheduler immediately runs to select the next process.
Because of this, the CPU scheduler is one of the most frequently executed parts of the OS, running at least once every 100 milliseconds, and often much more.
Intermediate Scheduling and Swapping¶
Beyond the short-term CPU scheduler, some operating systems employ an intermediate form of scheduling. This is not about deciding which process runs next on the CPU, but about deciding which processes are allowed to be in memory and thus eligible for the ready queue.
The key idea is that sometimes it's beneficial to reduce the degree of multiprogramming (the number of processes in memory). This is done through a mechanism called swapping.
- Swapping Out: The OS can decide to remove a process completely from memory. It saves the entire process's image (its code, data, stack, and PCB state) to a special area on the disk (called the swap space). This process is now no longer in the ready queue and is not competing for the CPU.
- Swapping In: Later, the OS can load the swapped-out process from disk back into memory, restoring its status so it can continue execution from where it left off.
Why is Swapping Necessary? Swapping is typically used as a memory management technique. It becomes necessary when the system has overcommitted memory (i.e., there are more processes in memory than physical RAM can comfortably hold). By swapping out some idle processes, the OS frees up physical memory for active processes. We will discuss swapping in detail in Chapter 9.
3.2.3 Context Switch¶
When the CPU scheduler decides to stop running one process and start running another, a fundamental operation must occur: the context switch. This is the mechanism that makes multitasking possible.
What is a Context Switch?¶
A context switch is the process of saving the state of the currently running process (the "old" process) and loading the saved state of the new process to run.
- The "Context": The context of a process is all the information the CPU needs to resume that process from exactly where it left off. This is represented in the process's PCB and includes:
- The value of all CPU registers (including the program counter).
- The process state.
- Memory-management information (like page table pointers).
- The Operation: The core task is a state save of the old process's context into its PCB, followed by a state restore of the new process's context from its PCB.
Go to Figure 3.6 for a diagram illustrating a context switch between two processes, P0 and P1.
Context Switch Overhead¶
It is crucial to understand that context-switch time is pure overhead. While the CPU is busy saving and restoring contexts, it is not executing any user instructions or performing useful work for any process. The system's performance depends on making context switches as fast as possible.
- Typical Speed: A context switch typically takes several microseconds.
- Factors Influencing Speed:
- Hardware Support: This is the biggest factor. Some processors have features like multiple register sets. This allows a context switch to happen by simply changing a hardware pointer to the current register set, which is extremely fast. If there are more processes than register sets, the OS must fall back to copying data to and from memory.
- Memory Speed: The speed of the RAM.
- Number of Registers: The more data that must be copied, the longer it takes.
- OS Complexity: More complex operating systems, especially those using advanced memory management (like paging, covered in Chapter 9), have more work to do during a context switch (e.g., switching page tables, flushing translation lookaside buffers (TLBs)).
Multitasking in Mobile Systems¶
Mobile operating systems like iOS and Android handle process scheduling with additional constraints, primarily to conserve battery life.
iOS (Apple):
- Early versions were very restrictive. Only one user application could run in the foreground (the app on the screen). All other user apps were suspended (not scheduled for CPU time).
- Starting with iOS 4, a limited form of multitasking was introduced, allowing a single foreground app to run concurrently with multiple background apps (in memory but not on the display). The API allowed certain app types (like music players) to perform limited tasks in the background.
- As hardware improved (more memory, multiple cores, better batteries), iOS supported richer multitasking, such as split-screen on iPads, which allows two foreground apps to run simultaneously.
Android:
- Android has always supported multitasking and does not restrict which applications can run in the background.
- However, if a background app needs to perform work, it must use a service. A service is a separate application component that runs on behalf of the background process without a user interface.
- Example: A music streaming app uses a service. When you switch to another app, the main application may be suspended, but the service continues running to send audio data to the device driver. This design is efficient because services have a small memory footprint.
3.3 Operations on Processes¶
In a modern operating system, processes are dynamic. They can be created and terminated on the fly. This section covers the mechanisms for these fundamental operations.
3.3.1 Process Creation¶
A process can create other new processes during its execution. This leads to a hierarchical relationship:
- Parent Process: The creator process.
- Child Process: The new process that was created.
Each child process can, in turn, create more processes, forming a tree (or hierarchy) of processes.
Operating systems like UNIX, Linux, and Windows identify each process by a unique integer known as a process identifier (pid). The kernel uses the pid as a key to access all of a process's attributes in its PCB.
Go to Figure 3.7 to see a typical process tree on a Linux system.
Let's analyze this figure:
- The root of the tree is the
systemdprocess, which always has pid = 1. This is the first user process started when the system boots, and it is the ultimate parent of all other user processes. systemdcreates child processes to manage system services. In the figure, it has created:logind(pid 8415): Manages users who log on directly to the system.sshd(pid 3028): Manages remote connections via Secure Shell (SSH).
- These service processes, in turn, create children for user sessions:
logindhas created abashshell (pid 8416) for a logged-in user.- The user, working in the
bashshell, has launched two child processes: thepscommand (pid 9298) and thevimeditor (pid 9204). - Similarly, the
sshdprocess has created a childsshd(pid 3610) to handle a specific remote connection, which then launched atcshshell (pid 4005) for the remote user.
The init and systemd Processes¶
There is an important historical note here. Traditional UNIX systems used a process called init (System V init) as the root parent process with pid 1. Linux initially adopted this.
Modern Linux distributions have replaced init with systemd. While both serve as the initial root process, systemd is more flexible and provides a wider range of services than the traditional init system.
You can view active processes on UNIX/Linux using the ps command. The command ps -el lists complete information for all processes. You can trace the parent-child relationships to build a process tree in your mind. Linux also provides the pstree command, which visually displays the entire process tree.
Resource Sharing in Process Creation¶
When a parent creates a child process, a critical question is: how are resources handled? The child process needs resources (CPU time, memory, files, I/O devices) to run. There are two general approaches:
- The child process may obtain its resources directly from the operating system.
- The child process may be constrained to use only a subset of the parent's resources.
The parent can manage resources for its children in different ways:
- Partitioning: The parent divides its resources among its children.
- Sharing: The parent and children share some resources (like memory segments or open files).
Restricting a child to a subset of the parent's resources is a important safety and stability measure. It prevents any single process from overloading the entire system by creating too many resource-hungry child processes.
Beyond just allocating resources, a parent process can also pass specific data or resources to its child to define its task.
- Passing Initialization Data: The parent can give the child input data. For example, a parent process could create a child whose job is to display a file. The parent would pass the filename (e.g.,
hw1.c) to the child. The child then uses that name to open the file and display it. - Passing Resources: Some operating systems allow the parent to pass actual resources. In the same example, the parent could open the file
hw1.cand the terminal device, and then pass these two open files to the child. The child's job then simplifies to just copying data from the input file to the output terminal file.
Execution and Address Space Possibilities¶
When a process creates a child, the OS must decide two things:
Execution Flow:
- The parent and child execute concurrently.
- The parent waits for some or all of its children to finish before resuming.
Address Space:
- The child is a duplicate of the parent process (same program and data).
- The child has a brand new program loaded into it.
Process Creation in UNIX/Linux¶
UNIX and Linux provide a classic and clear model for process creation using two key system calls: fork() and exec().
Step 1: Creating a Copy with fork()
- A new process is created by the
fork()system call. - The key effect of
fork()is that the child process is an exact duplicate of the parent's address space. This means both processes have the same code, data, heap, and stack at the moment of creation. - Both the parent and the child start executing from the instruction immediately after the
fork()call. - The only way to tell them apart is by the return value of
fork():- In the child process,
fork()returns 0. - In the parent process,
fork()returns the child's positive process identifier (pid). - If
fork()fails, it returns -1.
- In the child process,
Step 2: Loading a New Program with exec()
- After forking, one of the processes (often the child) typically uses an
exec()system call. exec()replaces the current process's memory space with a brand new program. It loads the binary file of the new program, destroying the old memory image (the one that contained theexec()call).- Because it replaces the entire address space, a successful call to
exec()does not return; the process begins executing the new program's first instruction.
Step 3: Waiting for the Child with wait()
- The parent process, after creating a child, can use the
wait()system call. wait()causes the parent to be suspended (blocked) until one of its children terminates.- Once the child terminates, the parent resumes execution from the point after the
wait()call.
Illustrative Example¶
Go to Figure 3.8 to see a C program that demonstrates this entire flow. Let's trace through it:
- The program starts as a single process.
pid = fork();is executed. Now there are two identical processes.- Both processes check the value of
pid:- In the Child Process (
pid == 0):- It calls
execlp("/bin/ls", "ls", NULL);. - This
exec()call replaces the child's memory with the/bin/lsprogram (the directory lister). - The child executes
lsand then terminates.
- It calls
- In the Parent Process (
pid > 0, and is the child's actual PID):- It calls
wait(NULL). This blocks the parent, putting it to sleep. - When the child finishes, the parent wakes up.
- The parent then prints "Child Complete" and exits.
- It calls
- In the Child Process (
This sequence of events is also illustrated in Figure 3.9. The child inherits attributes from the parent but then becomes a completely different program, while the parent waits for it to finish its task.
The fork() Without exec() Scenario¶
It's important to note that a child process is not required to call exec() after a fork(). The child can choose to continue executing the same program as the parent. In this scenario:
- The parent and child become concurrent processes running the exact same code.
- Because
fork()creates a full copy of the address space, each process has its own separate copy of all data. Changes made to variables in the child will not affect the parent, and vice-versa.
This allows for a design where a single program can split into two parallel execution paths, each performing different tasks based on the return value of fork().
Process Creation in Windows¶
The Windows operating system uses a different model for process creation, centered on the CreateProcess() function.
While CreateProcess() serves the same purpose as fork() (a parent creates a child), its behavior and usage are quite different:
| Feature | UNIX/Linux (fork() and exec()) |
Windows (CreateProcess()) |
|---|---|---|
| Core Action | Creates a duplicate copy of the parent. | Creates a new process and loads a specified program into it. |
| Address Space | Child inherits a copy of the parent's address space. | Child starts with the address space of a new program. |
| System Calls | Two steps: fork() (create copy) then exec() (load program). |
One step: CreateProcess() does both at once. |
| Parameters | fork() takes no parameters. |
CreateProcess() requires at least ten parameters. |
Key Differences Explained:
- Inheritance vs. Specification: The UNIX model is based on inheritance (the child gets everything the parent has). The Windows model is based on specification (the parent must specify exactly what program the child will run from the start).
- Simplicity vs. Control: The UNIX
fork()is a simple, parameter-less call that is very powerful. The WindowsCreateProcess()is a single, more complex function that gives the programmer fine-grained control over the new process's attributes from the moment of its creation.
Go to Figure 3.10 to see an example of a C program using CreateProcess(). This program creates a child process that immediately loads the mspaint.exe application (Microsoft Paint). The code uses default values for most of the ten parameters required by CreateProcess().
For those who need to work deeply with Windows processes, further study of the Windows API and the specific parameters of CreateProcess() is recommended.
Process Creation in Windows (Detailed Explanation)¶
The previous section introduced the CreateProcess() function. Now, let's break down its parameters and the accompanying program in detail.
Go to Figure 3.10 to see the C program that creates a process using the Windows API.
Key Data Structures for CreateProcess()¶
The CreateProcess() function uses two main structures to manage information:
STARTUPINFOStructure: This structure allows the parent to specify many properties of the new child process's user interface and standard handles. This includes:- The window's size and position on the screen.
- The window's appearance (e.g., should it be shown or hidden?).
- Handles for standard input, output, and error files.
In the provided code,
ZeroMemory(&si, sizeof(si));is used to allocate and initialize the memory for this structure, andsi.cb = sizeof(si);sets the size of the structure, which is a required step.
PROCESS_INFORMATIONStructure: AfterCreateProcess()successfully creates the new process, it returns information about that process back to the parent by filling in this structure. It contains:- A handle to the new process (
hProcess). - A handle to the new process's primary thread (
hThread). - The process identifier (pid) of the new process (
dwProcessId). - The thread identifier of the new process's primary thread (
dwThreadId).
- A handle to the new process (
Analyzing the CreateProcess() Call¶
Let's examine the call to CreateProcess() in the code:
NULL: The first parameter (application name) isNULL. This means we are not explicitly specifying the application name."C:\\WINDOWS\\system32\\mspaint.exe": The second parameter (command line) specifies the full path to the executable file to be loaded. Since the first parameter isNULL, the system uses this command line to find and load themspaint.exeapplication (Microsoft Paint).NULL, NULL, FALSE: The next three parameters concern handle inheritance. We are specifying that the child process should NOT inherit handles from the parent process.0, NULL, NULL: These parameters specify no creation flags, use the parent's environment block, and use the parent's current directory.&si, &pi: The final two parameters are pointers to theSTARTUPINFOandPROCESS_INFORMATIONstructures we prepared earlier.
The Parent's Wait and Cleanup¶
WaitForSingleObject(pi.hProcess, INFINITE);: This is the Windows equivalent of the UNIXwait()system call. The parent process calls this function and passes it the handle to the child process (pi.hProcess). TheINFINITEparameter means the parent will wait indefinitely until the child process terminates. This blocks the parent, just likewait()does in UNIX.CloseHandle(pi.hProcess);andCloseHandle(pi.hThread);: In Windows, when you are done using a resource like a process or thread handle, you must explicitly close it to avoid resource leaks. This is a crucial cleanup step.
This entire flow is the Windows way of achieving what the UNIX program in Figure 3.8 and the associated diagram in Figure 3.9 accomplish: creating a child process, waiting for it to finish, and then continuing.
3.3.2 Process Termination¶
This section explains all the ways a process ends and how the operating system cleans up afterward.
How a Process Ends Voluntarily¶
A process normally terminates when it has done its job. It executes its final line of code and then explicitly asks the operating system to delete it by invoking the exit() system call.
- What happens during
exit()? The operating system performs a cleanup routine. It deallocates all resources that were assigned to the process. This includes:- Physical and Virtual Memory
- Open Files
- I/O Buffers
- Communication with the Parent: The process can return a status value (like an integer) to its parent process. The parent retrieves this status using the
wait()system call, which we'll discuss in more detail soon.
Involuntary Termination: When a Parent Kills a Child¶
A process doesn't always get to finish on its own terms. Termination can also be forced, usually by its parent process, using a system call like TerminateProcess() in Windows.
- Why is this restricted? This power is typically restricted to the parent process. If any process could terminate any other, a user or a buggy application could arbitrarily kill another user's processes, causing chaos.
- How does the parent know which process to kill? When a parent creates a child process, the operating system gives the parent the unique process identifier (PID) of the new child.
The textbook lists several reasons a parent might terminate a child:
- Resource Overuse: The child is using more resources (CPU time, memory) than it was allocated.
- Task No Longer Needed: The job the child was created for is no longer required.
- Parent is Exiting: On many operating systems, if a parent process terminates, all of its children must be terminated as well.
This last scenario leads to an important concept called cascading termination. If a process terminates, the OS will automatically terminate all of its children, and those children's children, and so on, in a cascading manner.
The Technical Details in UNIX/Linux¶
Let's look at the specific system calls used in UNIX and Linux systems.
The
exit()System Call: A process callsexit()to end itself. The parameter it passes (e.g.,exit(1)) is the exit status that its parent can read. In C programs, even if you don't explicitly callexit(), the C run-time library will call it for you when themain()function returns.The
wait()System Call and Zombie Processes: This is a crucial concept for process management.- A parent uses
wait(&status)to pause its own execution until one of its child processes terminates. - The
wait()call gives the parent the child's exit status and its process identifier (PID).
What is a Zombie Process? When a process terminates, its resources are freed, but its entry in the process table must remain. This entry stays until the parent calls
wait()to read the child's exit status. A process in this state—terminated but still having an entry in the process table—is called a zombie process.- All processes become zombies briefly upon termination.
- The zombie is finally "reaped" (its process table entry is freed) when the parent calls
wait().
- A parent uses
Orphan Processes and Their Adoption What happens if a parent terminates without calling
wait()on its children? The children become orphan processes.Traditional UNIX systems solve this by having the
initprocess (the very first process, PID 1, which you learned about in Section 3.3.1) automatically become the new parent of all orphaned processes. Theinitprocess periodically callswait(), which cleans up any zombies left behind by their original parent.In modern Linux systems, the
initprocess has largely been replaced bysystemd. However,systemd(or another designated process) still performs this same crucial role of adopting orphan processes and reaping them to prevent permanent zombies.
3.3.2.1 Android Process Hierarchy¶
This section explains how Android, an operating system for mobile devices with limited resources like memory, decides which processes to terminate when it needs to free up system resources. Unlike a desktop OS that might have more flexibility, Android cannot afford to terminate processes arbitrarily. Instead, it uses a strict importance hierarchy.
The Need for a Hierarchy¶
Because of constraints like limited memory, Android must sometimes terminate existing processes to make resources available for new or more important processes. The goal is to do this in a way that is least disruptive to the user. It does this by ranking processes by importance and terminating them from least important to most important.
The Android Process Importance Hierarchy¶
Here are the process classifications, listed from most important to least important:
Foreground Process: This is the most important type of process. It is the current process visible on the screen, representing the application the user is actively interacting with right now (e.g., the app you have open and are typing in). Terminating this would directly interrupt the user.
Visible Process: This is a process that is not in the direct foreground but is still performing an activity that the user is aware of. A classic example is a navigation app that is running and displaying your route on a portion of the screen while another app is in the foreground. The user can see its output, so it has high importance.
Service Process: This is a process that is running a background service, but the service is performing an activity that is "apparent" to the user. The textbook gives the example of streaming music. Even though you're not in the music app, you know the service is running because you can hear the music. Terminating it would be noticeable.
Background Process: This is a process that is performing an activity, but it is not apparent to the user at the moment. An example is an app that you used earlier but have since navigated away from, and it's not doing anything critical. The user wouldn't immediately notice if this process was terminated.
Empty Process: This is the least important type of process. It is a cache of a previously used application that holds no active components. It's kept in memory only to speed up a future relaunch of the app. Terminating this has no negative effect on the user experience.
How Termination Works¶
When system resources must be reclaimed, Android will terminate processes in this exact order of increasing importance:
- It will first terminate Empty processes.
- If more resources are needed, it will then terminate Background processes.
- It will continue this pattern, moving up the hierarchy, until enough resources are freed.
Process Ranking and Lifecycle¶
The operating system is responsible for assigning each process the highest ranking it qualifies for.
- Example: If a process is both providing a service (like music playback) and is partially visible on the screen, it will be assigned the more-important Visible classification, not the Service classification.
Furthermore, Android development guidelines require apps to follow specific rules for their process life cycle. When these guidelines are followed, the state of a process (like where you were in a game or what you had typed in a note) is saved before the process is terminated. If the user navigates back to the application later, the process can be restarted and resumed from its saved state, making the termination and relaunch much smoother and less noticeable.
3.4 Interprocess Communication¶
This section introduces the concept of how processes can work together, which is a fundamental part of modern operating systems.
Independent vs. Cooperating Processes¶
Processes running at the same time in an OS can be categorized into two types:
- Independent Process: A process that does not share data with any other process. It operates in isolation and is not affected by, nor can it affect, the execution of other processes.
- Cooperating Process: A process that can interact with other processes. It can affect or be affected by them. Any process that shares data with another is automatically a cooperating process.
Why Allow Process Cooperation?¶
There are three primary reasons for enabling processes to cooperate:
Information Sharing: Several applications might need access to the same data. A common example is your computer's clipboard; when you copy and paste, multiple applications are accessing the same piece of information. The OS must provide an environment for this concurrent access.
Computation Speedup: To make a large task run faster, we can break it down into smaller subtasks that run in parallel. Important Note: This provides a real speedup only if the computer has multiple processing cores (CPUs), allowing true parallel execution. Otherwise, the tasks are just switching on a single core.
Modularity: We can design a system in a modular way, dividing system functions into separate, cooperating processes (or threads). This makes the system easier to build, maintain, and extend.
The Two Models of Interprocess Communication (IPC)¶
For cooperating processes to exchange data, we need a mechanism called Interprocess Communication (IPC). There are two fundamental models for this:
Shared Memory: In this model, a region of memory is set up that can be accessed by all the cooperating processes. Processes exchange information by simply reading from and writing to this shared region. It's like a shared whiteboard that multiple people can read and write on.
Message Passing: In this model, processes communicate by sending and receiving discrete messages to/from each other. This communication typically goes through the operating system's kernel. It's like passing notes between people; the note is handed from one to the other.
Go to Figure 3.11 in your textbook. This figure contrasts the two models visually:
- In diagram (a), Shared Memory, Process A and Process B both have access to a common block of memory. The data exchange happens directly in this space.
- In diagram (b), Message Passing, Process A and Process B send messages (m0, m1, etc.) to each other via a message queue managed by the kernel.
Comparing Shared Memory and Message Passing¶
Both models are common, and many operating systems support both.
Advantages of Message Passing:
- It is useful for exchanging smaller amounts of data.
- It is easier to implement because the kernel handles the communication, avoiding complex conflict scenarios.
- It can be extended to distributed systems (where processes are on different computers connected by a network) much more easily than shared memory.
Advantages of Shared Memory:
- It can be faster once it's set up. Message passing requires a system call (and therefore kernel intervention) for every single message sent or received, which is computationally expensive. In shared memory, system calls are needed only to establish the shared region. After that, processes read and write to the shared memory just like normal memory accesses, without any need for the kernel's help, making it very fast.
The following sections, 3.5 and 3.6, will explore shared-memory and message-passing systems in more detail.
MULTIPROCESS ARCHITECTURE—CHROME BROWSER¶
This section explains how modern web browsers, specifically Google Chrome, use a multiprocess architecture to solve problems of stability, security, and performance.
The Problem with Single-Process Browsers¶
Websites use active content like JavaScript, Flash, and HTML5 to create dynamic experiences. However, this code can contain bugs that cause a web browser to become slow or even crash entirely.
This becomes a major issue with tabbed browsing, where a single browser instance displays multiple websites in different tabs. In a traditional, single-process browser, all tabs run within the same process.
- The Critical Flaw: If a website in one tab crashes, it takes down the entire browser process, causing all your tabs to crash. This is a poor user experience.
Chrome's Solution: A Multiprocess Architecture¶
Google's Chrome browser was designed to fix this problem by using multiple cooperating processes. It separates different functions into distinct process types:
The Browser Process:
- This is the main manager of the entire application.
- It is responsible for the user interface (the address bar, buttons, tabs themselves) and for disk and network input/output (I/O).
- Only one browser process exists for the entire Chrome application.
The Renderer Processes:
- These processes contain the logic for rendering (drawing) a web page. They interpret and execute the HTML, JavaScript, and CSS for a site.
- A general rule is one renderer process per website tab. This means if you have ten tabs open, you likely have ten active renderer processes.
- This is the core of the solution.
The Plug-in Processes:
- A separate process is created for each type of plug-in (like Adobe Flash or Apple QuickTime).
- This isolates the often-buggy and vulnerable plug-in code from the rest of the browser.
Advantages of This Multiprocess Design¶
This architecture provides two massive benefits:
Stability (Crash Isolation):
- Because each website runs in its own isolated renderer process, if one website crashes, only that specific renderer process dies.
- The browser process and all other renderer processes for your other tabs remain completely unaffected. You might see an "Aw, Snap!" error on one tab, but all your other tabs continue to work perfectly.
Security (The Sandbox):
- Each renderer process runs in a restricted environment called a sandbox.
- The sandbox limits the renderer's access to sensitive system resources like your disk and network. Even if a malicious website exploits a bug in the renderer, the sandbox prevents the attack from doing significant harm to your system, as the compromised process has very limited permissions.
3.5 IPC in Shared-Memory Systems¶
This section dives into the details of how processes communicate by sharing a common block of memory. This is a powerful but complex mechanism that requires careful setup and coordination.
Establishing Shared Memory¶
For processes to use shared memory, they must first establish a region of memory that they all can access. Here is the typical process:
- One process creates a shared-memory segment. This segment resides within the address space of the creating process.
- Any other process that wants to communicate must then attach this same shared-memory segment to its own address space.
- Normally, the operating system strictly enforces memory isolation, preventing one process from accessing another's memory. Using shared memory requires that the processes explicitly agree to remove this restriction.
Once the shared region is set up, processes can exchange information by directly reading and writing data to these shared areas. The operating system is no longer involved in the actual data transfer. The processes themselves are responsible for:
- Deciding the format of the data.
- Deciding the location of data within the shared segment.
- Ensuring they do not write to the same location at the same time, which would cause data corruption.
The Producer-Consumer Problem¶
A classic example that demonstrates cooperating processes is the producer-consumer problem. This is a common paradigm in computing.
- A producer process generates information or data.
- A consumer process uses or processes that data.
Real-world examples:
- A compiler (producer) generates assembly code that is consumed by an assembler (consumer).
- A web server (producer) provides HTML files and images that are consumed by a client web browser (consumer).
Solving with a Shared Buffer¶
To allow the producer and consumer to run concurrently, we need a buffer—a temporary storage area—that resides in shared memory. The producer adds items to the buffer, and the consumer removes them.
Crucially, the processes must be synchronized. The consumer must not try to take an item from an empty buffer, and (in a bounded system) the producer must not try to add an item to a full buffer.
There are two types of buffers:
- Unbounded Buffer: Has no practical size limit. The consumer might have to wait for new items, but the producer can always produce.
- Bounded Buffer: Assumes a fixed buffer size. This is the more common and practical scenario. Here, the consumer must wait if the buffer is empty, and the producer must wait if the buffer is full.
The Bounded-Buffer Solution in Detail¶
Let's examine a shared-memory implementation of a bounded buffer. The following variables reside in the shared memory region:
#define BUFFER_SIZE 10
typedef struct {
. . . // This struct defines what an "item" is
} item;
item buffer[BUFFER_SIZE];
int in = 0;
int out = 0;
The buffer is a circular array. It is managed with two logical pointers (indices):
in: Points to the next free position in the buffer where a producer can put a new item.out: Points to the first full position in the buffer from which a consumer can take an item.
How to determine the buffer's state:
- The buffer is empty when
in == out. - The buffer is full when
((in + 1) % BUFFER_SIZE) == out. (This scheme leaves one empty slot to distinguish between the full and empty conditions).
The Producer Process
The producer has a local variable next_produced where it creates a new item.
item next_produced;
while (true) {
/* produce an item in next_produced */
// This while-loop is the synchronization check.
// It spins and does nothing if the buffer is full.
while (((in + 1) % BUFFER_SIZE) == out)
; /* do nothing */
// Once there's space, put the item in the buffer
buffer[in] = next_produced;
// and then update the 'in' pointer.
in = (in + 1) % BUFFER_SIZE;
}
The Consumer Process
The consumer has a local variable next_consumed where it will store the item it takes.
item next_consumed;
while (true) {
// This while-loop is the synchronization check.
// It spins and does nothing if the buffer is empty.
while (in == out)
; /* do nothing */
// Once there's an item, take it from the buffer
next_consumed = buffer[out];
// and then update the 'out' pointer.
out = (out + 1) % BUFFER_SIZE;
/* consume the item in next_consumed */
}
Important Notes on this Code:
- Buffer Capacity: This specific implementation allows a maximum of
BUFFER_SIZE - 1items in the buffer at one time. It is possible to write a solution that uses allBUFFER_SIZEslots. - The Critical Unaddressed Issue: The code as shown has a major flaw. The statements
buffer[in] = ...andin = ...in the producer (and similarly in the consumer) are not atomic. This means the operating system could interrupt a process after it checks the while-loop condition but before it finishes updating the buffer or the pointer. If both the producer and consumer are allowed to access the shared variablesin,out, andbufferat the same time without restriction, a race condition will occur, leading to incorrect data or a corrupted buffer state. This problem of concurrent access is what process synchronization (covered in Chapters 6 and 7) is designed to solve.
3.6 IPC in Message-Passing Systems¶
This section introduces the second major model for Interprocess Communication (IPC), which is fundamentally different from the shared-memory approach.
Contrast with Shared Memory¶
In the previous section, you learned about shared memory, where processes communicate by reading and writing to a common memory region. This requires the programmer to explicitly set up that shared segment and write the code to manage access to it.
Message passing offers an alternative. Here, the operating system itself provides a facility for processes to communicate by sending and receiving discrete messages. The key advantage is that processes do not need to share the same address space. This makes message passing especially useful for:
- Distributed Systems: Where processes are on different computers connected by a network. For example, an Internet chat program uses message passing; each participant's client application exchanges messages over the network.
- Situations where enforcing memory isolation between processes is desirable for security or stability.
Basic Message-Passing Operations¶
A message-passing facility must provide at least two fundamental operations:
send(message)receive(message)
Message Size¶
Messages can be handled in two ways, representing a classic trade-off in OS design:
- Fixed-sized messages: The system-level implementation is straightforward and efficient for the OS. However, this makes programming more difficult because the application programmer must work within the fixed size limit.
- Variable-sized messages: The system-level implementation is more complex for the OS. However, this makes the programming task much simpler and more flexible.
Logical Implementation of Communication Links¶
For two processes, P and Q, to communicate, a communication link must exist between them. We are not concerned with the physical hardware (like a network cable or bus) but with the logical implementation—how the link is managed by the operating system from a software perspective.
There are three key design choices for logically implementing this link and its operations:
Direct or Indirect Communication:
- This concerns how processes name each other to establish a connection. Do they send a message directly to a process's name, or do they use an intermediate mailbox or port?
Synchronous or Asynchronous Communication:
- This concerns the timing and blocking behavior of the
send()andreceive()operations. Does the sender wait for the receiver to get the message? Does the receiver wait if no message is available?
- This concerns the timing and blocking behavior of the
Automatic or Explicit Buffering:
- This concerns what happens to messages that are sent before the receiver is ready for them. Where are they stored temporarily, and how much storage is available?
The following sections in your textbook will explore each of these three design issues in detail.
3.6.1 Naming¶
This section explains how processes identify each other to establish communication in a message-passing system. There are two primary strategies: direct and indirect communication.
Direct Communication¶
In direct communication, processes must explicitly name the other process they want to communicate with.
The basic primitives are defined as:
send(P, message)— Send a message to process P.receive(Q, message)— Receive a message from process Q.
Properties of a communication link in direct communication:
- A link is automatically established between any two processes that want to communicate. They only need to know each other's identity (PID).
- A link is exclusively between two processes.
- There is exactly one link between each pair of processes.
This is known as symmetric addressing because both the sender and receiver must name each other.
A variant uses asymmetric addressing:
send(P, message)— Send a message to process P. (Sender names the receiver).receive(id, message)— Receive a message from any process. The variableidis set to the name of the process that sent the message. (Receiver does not name a specific sender).
Disadvantage of Direct Communication:
The main drawback is limited modularity. The process identifiers (like P and Q) are hard-coded into the programs. If you need to change a process's identifier, you must find and update all references to that identifier in every other process that communicates with it. This makes the system inflexible and harder to maintain.
Indirect Communication¶
To solve the modularity problem, we use indirect communication via mailboxes (also called ports). A mailbox is an abstract object that acts as a temporary holding place for messages.
- Processes send messages to a mailbox and receive messages from a mailbox.
- Each mailbox has a unique identification (e.g., POSIX message queues use an integer value).
- Two processes can communicate only if they share a common mailbox.
The primitives are defined as:
send(A, message)— Send a message to mailbox A.receive(A, message)— Receive a message from mailbox A.
Properties of a communication link in indirect communication:
- A link is established only if both processes share a common mailbox.
- A link may be associated with more than two processes. (Multiple processes can share the same mailbox).
- Between two processes, multiple links may exist, each corresponding to a different mailbox.
The Multi-Receiver Problem and Solutions¶
Indirect communication introduces a new problem. Suppose processes P1, P2, and P3 all share mailbox A. P1 sends a message to A. If both P2 and P3 execute a receive() from A, which one gets the message?
The system designer must choose a policy to resolve this. The answer depends on the chosen method:
- Restrict a link to at most two processes.
- Allow only one process at a time to execute a
receive()operation on a given mailbox. - Allow the system to select a receiver arbitrarily (e.g., P2 or P3, but not both). The algorithm for selection could be round-robin. The system may also inform the sender about who received the message.
Mailbox Ownership¶
A critical design decision is who owns the mailbox.
1. Owned by a Process:
- The mailbox is part of the process's address space.
- We distinguish between the owner (the only process that can
receivefrom the mailbox) and the user (any process that cansendto the mailbox). - This eliminates confusion about who receives a message.
- When the owning process terminates, the mailbox is destroyed. Any process that later tries to send to it must be notified.
2. Owned by the Operating System:
- The mailbox has an independent existence and is not tied to a specific process's lifespan.
- The OS must provide system calls for a process to:
- Create a new mailbox.
- Send and receive messages through the mailbox.
- Delete a mailbox.
- The process that creates the mailbox is its default owner and initially the only receiver. However, the ownership and receiving privilege can be transferred to other processes via system calls, which can, in turn, lead to the multi-receiver scenario described above.
3.6.2 Synchronization¶
This section explains the timing and waiting behavior—the synchronization—of the send() and receive() operations in a message-passing system. This behavior is crucial for coordinating processes.
Blocking vs. Nonblocking Operations¶
Message passing can be implemented with either blocking (synchronous) or nonblocking (asynchronous) primitives. This is a key design choice that affects how processes interact.
There are four possible variations:
Blocking Send:
- The sending process is blocked (put to sleep) until the message is received by the receiving process or successfully placed into the mailbox.
- This provides a strong guarantee that the message has been delivered before the sender continues.
Nonblocking Send:
- The sending process sends the message and immediately resumes operation. It does not wait for any form of acknowledgment.
- This is faster but offers no guarantee that the message was received.
Blocking Receive:
- The receiving process is blocked until a message is available for it to receive.
- This is the most common pattern for a consumer that needs to wait for work.
Nonblocking Receive:
- The receiving process attempts to retrieve a message. It immediately gets a result, which is either a valid message or a null/no-message-available indicator.
- This is useful for a process that needs to check for messages while also doing other work.
The Rendezvous¶
When both the send() and receive() operations are blocking, it creates a situation known as a rendezvous. This means both the sender and the receiver synchronize at the moment of message transfer: the sender blocks until the receiver gets the message, and the receiver blocks until the sender provides one. They "meet" at the point of communication.
Solving the Producer-Consumer Problem with Blocking Send/Receive¶
Using blocking send() and receive() makes the producer-consumer problem much simpler compared to the shared-memory solution. The need for explicit checks for a full or empty buffer is handled automatically by the blocking nature of the calls.
Go to Figure 3.14: The Producer Process using Message Passing
message next_produced;
while (true) {
/* produce an item in next_produced */
send(next_produced);
}
- The producer creates an item and then immediately calls
send(). If the consumer is slow or the system buffer is full, the producer will automatically block on thesend()call until the consumer is ready. There is no need for a manualwhileloop to check for buffer space.
Go to Figure 3.15: The Consumer Process using Message Passing
message next_consumed;
while (true) {
receive(next_consumed);
/* consume the item in next_consumed */
}
- The consumer calls
receive(). If no message is available (i.e., the buffer is empty), the consumer will automatically block on thereceive()call until the producer sends a message. There is no need for a manualwhileloop to check for an empty buffer.
In this scheme, the synchronization is managed by the message-passing system itself, simplifying the application code significantly. The producer and consumer are perfectly synchronized: the producer cannot get ahead of the consumer, and the consumer waits for the producer.
3.6.3 Buffering¶
This section explains what happens to messages in transit—where they are stored while waiting to be received. This temporary storage is managed by a queue, and its implementation is a critical aspect of message-passing systems.
The Temporary Message Queue¶
Whether communication is direct or indirect, messages sent by a process don't vanish if the receiver isn't ready. They reside in a temporary queue. This queue can be implemented in one of three ways, which directly impacts whether a send() operation will block.
Three Queue Capacity Types¶
Zero Capacity (No Buffering):
- The queue has a maximum length of zero. This means the link cannot hold any waiting messages.
- Consequence for the Sender: The sender must block on every
send()operation until the recipient is ready and executes a matchingreceive(). This creates the rendezvous synchronization point discussed in the previous section. - This is often called a message system with no buffering.
Bounded Capacity:
- The queue has a finite length of n messages. This means it can hold up to n messages waiting to be received.
- Consequence for the Sender:
- If the queue is not full, the message is placed into it (either by copying the entire message or storing a pointer to it), and the sender can continue execution immediately (nonblocking send).
- If the queue is full, the sender must block until space becomes available in the queue (i.e., until the consumer receives at least one message).
- This is the most common and practical implementation, providing a balance between performance and resource management.
Unbounded Capacity:
- The queue has a theoretically infinite length. Any number of messages can wait in it.
- Consequence for the Sender: The sender never blocks. The
send()operation always completes immediately, as there is always space in the queue. - This is the most flexible for the sender but requires the system to have potentially very large memory resources to handle a large backlog of messages.
Systems with either bounded or unbounded capacity are referred to as having automatic buffering, as the system automatically manages the storage of messages without requiring the sender to always wait for the receiver.
3.7 Examples of IPC Systems¶
This section transitions from the theory of Interprocess Communication (IPC) to real-world implementations. We will look at how different operating systems actually provide these services.
The section covers four specific examples:
The POSIX API for Shared Memory: We will see the standard, portable set of system calls used in UNIX-like systems (Linux, macOS, BSD) to set up and use shared memory regions.
Message Passing in the Mach Operating System: Mach is a highly influential microkernel where message passing is the fundamental IPC mechanism, and we'll explore how it works.
Windows IPC: We will examine the IPC mechanisms in Microsoft Windows. Interestingly, Windows often uses shared memory as a mechanism to implement high-performance message passing.
Pipes: We will discuss one of the earliest and simplest IPC mechanisms on UNIX systems, which allows for a straightforward stream of data between processes.
Each example will illustrate how the general concepts of shared memory, message passing, synchronization, and buffering are applied in practice.
3.7.1 POSIX Shared Memory¶
This section details the specific steps and system calls used to create and manage shared memory in POSIX-compliant systems like Linux and macOS. The POSIX standard provides a clear API for this purpose.
The Core Concept: Memory-Mapped Files¶
POSIX shared memory is implemented using a technique called memory-mapped files. This means the region of shared memory is associated with a special file in the system. Instead of reading and writing to this file with standard I/O operations, processes map it directly into their own address space, allowing them to access it as if it were regular memory.
Step-by-Step Process for Setting Up POSIX Shared Memory¶
Here is the sequence of system calls a process must use to create a shared memory segment:
Create or Open the Shared Memory Object:
shm_open()- A process starts by creating a shared-memory object using the
shm_open()system call.
fd = shm_open(name, O_CREAT | O_RDWR, 0666);
name: A unique name (like "/my_shared_mem") that other processes will use to access this same shared memory. This is how processes agree on which segment to use.O_CREAT | O_RDWR: Flags that tell the OS to create the object if it doesn't exist (O_CREAT) and to open it for both reading and writing (O_RDWR).0666: File permissions (read and write for owner, group, and others).- Return Value: On success,
shm_open()returns an integer file descriptor (fd), which is a handle to the shared-memory object.
- A process starts by creating a shared-memory object using the
Set the Size of the Object:
ftruncate()- Once the object exists, you must define its size using
ftruncate().
ftruncate(fd, 4096);
- This call sets the size of the shared-memory object to 4096 bytes (or whatever size you specify). This allocates the actual space.
- Once the object exists, you must define its size using
Map the Object into the Process's Address Space:
mmap()- This is the final and most important step. The
mmap()system call maps the shared-memory object (the file) into the calling process's address space.
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
- Return Value:
mmap()returns a pointer (ptr) to the start of the shared memory region within this process's address space. - From this point on, the process can read and write to the shared memory by using this pointer, just like any other memory location. Any changes made are visible to all other processes that have mapped the same object.
- This is the final and most important step. The
How Other Processes Connect:
Another process that wants to access the same shared memory uses the same procedure. It calls shm_open() with the same name (but without the O_CREAT flag if it already exists), then ftruncate() (if it's the one setting the size), and finally mmap() to get its own pointer to the same physical memory region.
This section provides a concrete code example of the producer-consumer model implemented using the POSIX shared-memory API we just discussed. The example consists of two separate programs: a Producer and a Consumer.
The Producer Program¶
Go to Figure 3.16: Producer process illustrating POSIX shared-memory API.
The producer's job is to create the shared memory segment and write data into it.
Setup: The program defines the size of the shared memory (
SIZE = 4096) and a unique name for the shared object (name = "OS").Creating the Shared Memory Object:
fd = shm_open(name, O_CREAT | O_RDWR, 0666);- This creates a shared-memory object named "OS" with read/write permissions. The
O_CREATflag ensures it is created if it doesn't exist.
Setting the Size:
ftruncate(fd, SIZE);- This configures the size of the shared-memory object to 4096 bytes.
Memory Mapping:
ptr = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);- This is the critical step. It maps the shared-memory object into the producer's address space.
- The
MAP_SHAREDflag is essential: it means that any writes to this memory region will be visible to all other processes that have mapped the same object.
Writing Data:
- The producer writes the strings "Hello" and "World!" to shared memory using
sprintf(ptr, ...). - After writing each string, it increments the pointer
ptrby the number of bytes written (strlen(message_0)). This moves the pointer forward in the shared buffer so the next write doesn't overwrite the previous one.
- The producer writes the strings "Hello" and "World!" to shared memory using
When the producer finishes, the shared memory contains the string "HelloWorld!". The producer then terminates, but the shared-memory object persists in the system.
The Consumer Program¶
Go to Figure 3.17: Consumer process illustrating POSIX shared-memory API.
The consumer's job is to access the existing shared memory segment and read the data from it.
Opening the Shared Memory Object:
fd = shm_open(name, O_RDONLY, 0666);- The consumer opens the existing shared-memory object "OS". It opens it as
O_RDONLY(Read-Only) since it only needs to read the data.
Memory Mapping:
ptr = mmap(0, SIZE, PROT_READ, MAP_SHARED, fd, 0);- The consumer maps the same object into its own address space. Notice the permissions are
PROT_READbecause it only needs to read.
Reading Data:
printf("%s",(char *)ptr);- The consumer simply prints the contents of the shared memory, starting from the pointer
ptr. This will output "HelloWorld!".
Cleanup:
shm_unlink()shm_unlink(name);- This is a very important system call. It removes the shared-memory object named "OS" from the system. The object is destroyed once all processes that had it mapped close it. This is how the system reclaims the shared memory resource.
Key Points and Omissions¶
- This example is simplified. In a real producer-consumer system, you would need a synchronization mechanism (like a semaphore, covered in Chapter 6) to ensure the consumer doesn't try to read before the producer has finished writing. The code as shown has a race condition.
- The use of
shm_unlinkby the consumer is just one design choice. The producer, or a separate process, could also be responsible for cleanup. - The text mentions that memory mapping is covered in more detail in Section 13.5.
3.7.2 Mach Message Passing¶
This section explores how message passing is implemented in the Mach operating system, a highly influential microkernel that forms the core of macOS and iOS. Mach treats message passing as its fundamental mechanism for all communication.
Core Concepts in Mach¶
- Tasks and Threads: In Mach, a task is similar to a process but is a container that holds resources, including multiple threads of control.
- Messages and Ports: All communication in Mach is done via messages. Messages are sent to and received from ports, which are Mach's name for mailboxes.
- Ports are finite in size (they have a bounded queue).
- Ports are unidirectional. For two-way communication, you need two ports: one for sending the request and a separate reply port for the response.
- A port can have multiple senders but only one receiver.
Port Rights and Security¶
A crucial security concept in Mach is port rights. A port right is a capability or permission that a task must hold to interact with a port.
- To receive from a port, a task must have the
MACH_PORT_RIGHT_RECEIVEright. - The task that creates a port is its owner and initially holds the receive right.
- Ownership can be transferred. For example, if Task T1 (owning port P1) sends a message to Task T2 and expects a reply, T1 must grant T2 the
MACH_PORT_RIGHT_SENDright for P1 so T2 can send the reply back.
Special Ports: When a task is created, it gets two special ports:
- Task Self Port: The task can send messages to this port to communicate with the kernel (which has the receive right).
- Notify Port: The kernel sends event notifications to this port (the task has the receive right).
Creating a Port and Sending/Receiving Messages¶
The mach_port_allocate() system call is used to create a new port and allocate its message queue. The following code example creates a port:
mach_port_t port; // This is the name (an integer) for the port right
mach_port_allocate(
mach_task_self(), // A reference to the current task
MACH_PORT_RIGHT_RECEIVE, // The right to create for this port
&port); // Where to store the port name
- Port names are simple integers, similar to UNIX file descriptors.
The Bootstrap Server: Tasks also have a bootstrap port. This allows a task to register a port it created with a system-wide bootstrap server. Other tasks can then look up this port by name in the registry to obtain send rights, enabling discovery and communication between unrelated tasks.
Message Queues and Message Structure¶
- Each port has a finite-sized queue. Messages are copied into this queue when sent.
- Delivery is reliable and all messages have the same priority.
- Mach guarantees FIFO order for messages from the same sender, but not for messages from different senders (they may be interleaved).
A Mach message consists of two parts:
- A Fixed-Size Header: Contains metadata like the message size and the source and destination ports. The source port acts as a "return address" for replies.
- A Variable-Sized Body: Contains the actual data.
Messages can be of two types:
- Simple Messages: Contain ordinary, unstructured user data. The kernel does not interpret this data.
- Complex Messages: Can contain:
- Out-of-line data: Pointers to memory locations containing the actual data. This is highly efficient for large data, as it avoids the cost of copying the entire data into the message; only the pointer is sent.
- Port rights: Used to transfer capabilities to another task.
The Unified API: mach_msg()¶
Mach uses a single function for both sending and receiving messages: mach_msg(). A parameter passed to this function specifies the operation:
MACH_SEND_MSG: To send a message.MACH_RCV_MSG: To receive a message.
This unified API simplifies the message-passing interface.
Code Example: Client-Server Communication¶
Go to Figure 3.18: Example program illustrating message passing in Mach.
The example shows how a client task sends a message to a server task using two ports: a client port and a server port.
Client Code (The Sender):
- Message Structure: The message is defined as a struct containing a header (
mach_msg_header_t) and the data (int data). - Construct the Header: The client fills out the message header:
msgh_size: The total size of the message.msgh_remote_port: The destination port (theserverport).msgh_local_port: The source port (theclientport), which the server will use as the "return address" for its reply.
- Send the Message: The client calls
mach_msg()with theMACH_SEND_MSGflag. The parameters specify the message header, the send operation, the size of the outgoing message, and various options (here, no timeouts and no special notify port).
Server Code (The Receiver):
- Receive the Message: The server calls
mach_msg()with theMACH_RCV_MSGflag. - Parameters for Receiving: It specifies the
serverport as the port to receive from and provides the size of the buffer (sizeof(message)) where the incoming message should be stored.
The Kernel's Role:
The user-level mach_msg() function calls mach_msg_trap(), which is a system call that switches to the kernel. The kernel then calls mach_msg_overwrite_trap() to handle the actual, low-level work of moving the message.
Flexible Send Operations and Handling Full Queues¶
A key feature of Mach is the flexibility in how it handles a send() operation when the destination port's queue is full. The sender can specify its behavior via parameters to mach_msg():
- Wait indefinitely until space becomes available in the queue.
- Wait for a maximum of n milliseconds before timing out.
- Do not wait at all; return immediately with an error.
- Temporarily cache the message. This advanced option is designed for servers. The kernel takes ownership of the message and will deliver it when space is available, sending a notification back to the original sender. This allows a server to continue processing other requests even if one client's reply port is full, without being blocked.
Performance Optimization: Avoiding Copying¶
A major historical drawback of message-passing systems is performance loss from copying data from the sender's memory space to the kernel, and then again from the kernel to the receiver's memory space.
Mach uses a clever optimization to avoid this:
- It uses virtual-memory-management techniques (covered in Chapter 10). Instead of copying the data, the kernel simply remaps the memory pages containing the message from the sender's address space into the receiver's address space.
- This means both the sender and receiver are accessing the same physical memory for the message body, eliminating the costly copy operations.
Important Limitation: This high-performance technique only works for messages within a single system (intrasystem). It cannot be used for messages sent over a network to a different computer in a distributed system. This optimization is a primary reason for Mach's high-performance IPC.
3.7.3 Windows Inter-Process Communication (IPC)¶
This section explains how different programs and system components communicate within the Windows operating system. The design is modular, meaning it's built from separate, well-defined parts (modules) that talk to each other. This makes the system more powerful and easier to update.
The Client-Server Model and Subsystems¶
In Windows, application programs (like your web browser or a game) often don't talk directly to the operating system kernel. Instead, they communicate with subsystems. A subsystem is a part of the OS that provides a specific environment or set of services. For example, one subsystem might manage the graphical user interface, while another handles security.
- Application programs act as clients.
- Subsystems act as servers.
- They communicate using a message-passing mechanism. The client sends a request message to the server, and the server sends back a reply message.
The Advanced Local Procedure Call (ALPC)¶
The specific message-passing facility in Windows is called the Advanced Local Procedure Call (ALPC). It's designed for communication between two processes on the same computer.
- Think of it like this: It's similar to a Remote Procedure Call (RPC), which is a way for a program to call a function on another machine across a network. However, ALPC is a highly optimized version used only for communication within a single Windows machine.
How ALPC Establishes a Connection: Ports and Channels¶
ALPC uses objects called ports to manage connections, similar to the Mach operating system.
Connection Port: A server process creates and "publishes" a connection-port object that other processes can see. It's like the server putting up a public "Now Open for Business" sign.
Connection Request: A client process that needs a service finds this connection port and sends a connection request. It's like a customer walking up to the front desk.
Communication Channel: When the server accepts the request, it creates a private, two-way communication channel just for that client. This channel is made up of a pair of private communication ports:
- One port for sending messages from the client to the server.
- One port for sending replies from the server back to the client.
- The server gives the client a "handle" (a reference) to this new channel.
This process is illustrated in Figure 3.19. The client connects to the server's public Connection Port. Once connected, they get handles to a private, two-way channel made of Communication Ports.
ALPC Message-Passing Techniques¶
ALPC is smart and uses one of three different methods to pass data, depending on the size of the message. This is done for efficiency.
For Small Messages (up to 256 bytes):
- Method: The message is copied directly into a queue in the port's memory.
- Analogy: Like passing a short, handwritten note to someone. It's quick and simple to copy the information.
For Larger Messages:
- Method: Uses a section object, which is a block of shared memory that both the client and server can access.
- How it works: The sender places the large message into the shared memory section. Then, it sends a very small message via the port (using method #1) that contains a pointer to where the data is located in the shared section. The receiver uses this pointer to read the data directly.
- Why it's better: This avoids the slow process of copying the entire large message from one process's memory to another's. It's like telling someone, "I've left the big report for you on the shared desk, here's exactly where to find it," instead of photocopying the entire 100-page report.
For Very Large Amounts of Data:
- Method: The server is given special permission to read from or write directly into the client process's own memory (its address space).
- Use Case: This is the most complex method and is used when the data is too large or doesn't fit well into the shared section object method.
Important Setup Note: The client must decide upfront when creating the channel if it will need to send large messages. If so, it requests that a section object (shared memory) be created for the channel.
Key Takeaway: Who Uses ALPC?¶
It is crucial to understand that application programmers do not use ALPC directly. It is not part of the standard Windows API that programmers use.
- Instead, applications use standard Remote Procedure Calls (RPC).
- When an RPC call is made to a process on the same machine, Windows internally and automatically translates that RPC into an ALPC for fast, local execution.
- Furthermore, many core Windows kernel services use ALPC to communicate with client processes. So, while you don't see it, ALPC is a fundamental mechanism making Windows run smoothly.
3.7.4 Pipes¶
A pipe is one of the most fundamental Inter-Process Communication (IPC) mechanisms. Think of it like a real-world water pipe: it acts as a conduit or a channel that allows data to flow from one process to another. Pipes were one of the very first IPC methods in early UNIX systems and remain popular because they provide a relatively simple way for processes to exchange information, though they do have some limitations.
When we talk about implementing or using a pipe, there are four key design questions we need to consider:
Four Key Issues in Pipe Implementation¶
Direction of Communication:
- Is the pipe unidirectional (data flows only in one direction, like a one-way street)?
- Or is it bidirectional (data can flow in both directions)?
Mode of Two-Way Communication:
- If the pipe is bidirectional, is it half-duplex (data can travel in both directions, but only in one direction at a time, like a walkie-talkie)?
- Or is it full-duplex (data can travel in both directions simultaneously, like a telephone call)?
Relationship Between Processes:
- Must the communicating processes have a specific relationship? For example, must one be the parent process and the other be its child process?
Network Communication:
- Can the pipe be used for communication between processes on different machines over a network?
- Or are the processes restricted to residing on the same machine?
Types of Pipes¶
In the following sections, we will explore two common types of pipes that are used on both UNIX/Linux and Windows systems. These are:
- Ordinary Pipes: The simpler, more basic form.
- Named Pipes: A more advanced and flexible version.
We will examine how each of these pipe types addresses the four key issues listed above.
3.7.4.1 Ordinary Pipes¶
This section dives into the first and simpler type of pipe: the Ordinary Pipe.
Concept and Behavior¶
Ordinary pipes work in a classic producer-consumer fashion.
- One process, the producer, writes data into one end of the pipe (the write end).
- The other process, the consumer, reads that data from the other end (the read end).
Because of this design, ordinary pipes have a critical characteristic: they are unidirectional. Data can only flow in one single direction.
- What if you need two-way communication? You must create two separate pipes. One pipe would handle communication from Process A to Process B, and the second, separate pipe would handle communication from Process B back to Process A.
Let's look at how this is implemented in UNIX and Windows.
Ordinary Pipes in UNIX Systems¶
On UNIX and Linux systems, you create an ordinary pipe using the pipe() system call.
The pipe(int fd[]) Function:
- This function creates a pipe and returns two file descriptors (integers that represent open files/I/O channels) in the
fd[]array. fd[0]is the read end of the pipe. You use this descriptor to read data from the pipe.fd[1]is the write end of the pipe. You use this descriptor to write data into the pipe.
Since UNIX treats everything as a file, a pipe is considered a special type of file. This means you use the standard read() and write() system calls to interact with it, using the file descriptors fd[0] and fd[1].
A Key Limitation: The Process Relationship An ordinary pipe cannot be accessed from outside the process that created it. It is private to its creator. The typical way to use it is:
- A parent process creates the pipe using
pipe(). - The parent then creates a child process using the
fork()system call. - Because the child process inherits all open file descriptors from its parent (as discussed in Section 3.3.1), it automatically has access to the same pipe.
Go to Figure 3.20: This figure illustrates the file descriptors for an ordinary pipe after a fork(). Both the parent and child have their own copies of the fd[] array, pointing to the same underlying pipe. The parent can write to fd[1], and the child can read that same data from fd[0].
Code Example Walkthrough:
Go to Figure 3.21: This C code shows the setup for a UNIX ordinary pipe.
Let's break down what the code in Figure 3.21 does:
- It includes necessary header files.
- It defines a buffer size and two constants for clarity:
READ_END(0) andWRITE_END(1). - It declares two character arrays for the messages and an integer array
fd[2]to hold the two file descriptors for the pipe. - It declares a variable
pidto store the process ID returned byfork().
What happens after fork()?
After the child process is created, both the parent and child run simultaneously. In this specific example, the plan is:
- The parent process will write the message "Greetings" to the pipe.
- The child process will read the message from the pipe.
A Critical Step: Closing Unused Ends The text highlights a very important practice: both processes must close the ends of the pipe they are not using.
- The parent, which is only writing, should close its read end (
fd[0]). - The child, which is only reading, should close its write end (
fd[1]).
Why is this so important?
- It ensures that the
read()system call behaves correctly. Theread()function will return a0(indicating an "end-of-file" condition) only when all write ends of the pipe are closed. - If the child did not close its write end, there would still be an open write descriptor on the pipe. If the parent finished writing and closed its end, the child's
read()call would not return0because it would still be possible for the child itself to write to the pipe. This would leave the child process waiting for more data that will never come, causing a potential hang. - Closing unused ends is crucial for proper coordination and detecting when the communication is finished.
Ordinary Pipes on Windows Systems¶
This section continues the discussion of ordinary pipes by explaining their implementation in the Windows operating system, where they are known as Anonymous Pipes.
Concept and Behavior in Windows¶
Windows anonymous pipes are conceptually very similar to UNIX ordinary pipes:
- They are unidirectional (one-way communication).
- They require a parent-child relationship between the communicating processes.
- Data is written using a standard write function (
WriteFile) and read using a standard read function (ReadFile).
The Windows API: CreatePipe()¶
The key function for creating an anonymous pipe in Windows is CreatePipe(). This function is passed four parameters that define the pipe's properties:
hReadPipe: A handle (a Windows reference to an object) for reading from the pipe.hWritePipe: A handle for writing to the pipe.lpPipeAttributes: A pointer to aSECURITY_ATTRIBUTESstructure. This is crucial because it allows the programmer to specify whether the pipe handles can be inherited by a child process.nSize: The suggested size, or buffer capacity, of the pipe (in bytes). The system may use this as a guideline but is not required to create a pipe of exactly this size.
Key Difference: Explicit Handle Inheritance¶
A major difference from UNIX is that Windows does not automatically inherit all open handles/file descriptors.
- In UNIX, after a
fork(), the child automatically gets a copy of the parent's file descriptors, including the pipe ends. - In Windows, the programmer must explicitly specify which handles the child process is allowed to inherit. This is a deliberate, security-conscious design.
How is this done?
- The programmer initializes a
SECURITY_ATTRIBUTESstructure and setsbInheritHandle = TRUE. - This structure is then passed to the
CreatePipe()function. - Furthermore, the parent process must redirect the child's standard input or output to the pipe handle. In the example from Figure 3.23, the parent wants the child to read from the pipe. Therefore, it redirects the child's standard input (
STD_INPUT_HANDLE) to the read handle of the pipe. - Because the pipe is half-duplex (unidirectional), the parent must also ensure the child does not inherit the write end of the pipe. This prevents the child from accidentally writing to the same pipe it's reading from, which could cause confusion or deadlock.
Code Example Walkthrough¶
Go to Figure 3.23: This C code shows the initial setup for a Windows anonymous pipe in the parent process.
Let's break down what the code in Figure 3.23 does:
- It includes the Windows header file
<windows.h>. - It declares two
HANDLEvariables:ReadHandleandWriteHandle. These will be the handles to the two ends of the pipe. - It declares a
STARTUPINFOstructure (si) and aPROCESS_INFORMATIONstructure (pi). These are required by Windows to create and manage the new child process. - It defines the message buffer and a
DWORDvariablewrittento track how many bytes were written.
Go to Figure 3.22 (for context): While Figure 3.22 shows the UNIX code completion, the logical flow for Windows is analogous but uses different API calls. After the setup in Figure 3.23, the parent process would:
- Call
CreatePipewith the appropriate security attributes to allow handle inheritance. - Use
CreateProcessto launch the child process (similar to the program in Figure 3.10), setting thebInheritHandlesparameter toTRUE. - The parent would then close its unused end of the pipe (in this case, the read end) before writing to the pipe using
WriteFile(WriteHandle, ...).
The Child Process (Figure 3.25):
- The child process does not receive the pipe handle as a direct parameter. Instead, because the parent redirected the child's standard input, the child can obtain the read handle by calling
GetStdHandle(STD_INPUT_HANDLE). - It can then read from this handle using the standard
ReadFile()function.
Summary of Limitations for Ordinary Pipes¶
The text concludes by reiterating the core limitations of ordinary (anonymous) pipes, which apply to both UNIX and Windows systems:
- Relationship Requirement: They require a parent-child relationship between the communicating processes. You cannot use an ordinary pipe to connect two unrelated processes that started independently.
- Machine Scope: Because of this relationship requirement, ordinary pipes can only be used for communication between processes on the same machine. They cannot be used for network communication.
3.7.4.2 Named Pipes¶
This section introduces a more powerful and flexible IPC mechanism: Named Pipes.
Limitations of Ordinary Pipes Recap¶
First, let's recall the shortcomings of ordinary pipes that named pipes are designed to overcome:
- They are temporary. The pipe only exists while the communicating processes are running.
- They require a parent-child relationship between the processes.
- They are unidirectional.
Advantages of Named Pipes¶
Named pipes provide a much more powerful and flexible communication tool. Their key features are:
- Persistence: A named pipe continues to exist in the system after the communicating processes have finished. It must be explicitly deleted, like a file.
- No Parent-Child Relationship: Processes do not need to be related. Any process that knows the name of the pipe can use it to communicate.
- Multiple Writers: Several processes can use the same named pipe for communication. In a typical scenario, a named pipe can have several writers.
- Bidirectional Communication Potential: While the specifics vary by OS, named pipes can support bidirectional communication.
Both UNIX and Windows systems support named pipes, but their implementations differ significantly.
Named Pipes in UNIX (FIFOs)¶
In UNIX and Linux systems, named pipes are called FIFOs (First-In, First-Out).
Creation and Management:
- A FIFO is created using the
mkfifo()system call. - Once created, it appears as a special type of file in the file system (you can see it with the
lscommand). - It is manipulated using the standard file operations:
open(),read(),write(), andclose(). - It is persistent: it remains in the file system until it is explicitly deleted using the
unlink()system call, just like a regular file.
Communication Characteristics:
- Bidirectional? FIFOs allow bidirectional communication in the sense that a process can open a FIFO for both reading and writing. However, the transmission is typically half-duplex. This means data can travel in both directions, but not at the same time.
- Solution for Full-Duplex: If you need simultaneous two-way communication, you must create two FIFOs (e.g., one called
fifo_AtoBand another calledfifo_BtoA). - Network Communication? No. The communicating processes must reside on the same machine.
- Data Type: UNIX FIFOs can only transmit byte-oriented data (a continuous stream of bytes).
Named Pipes in Windows Systems¶
Named pipes on Windows provide a richer and more feature-complete communication mechanism than UNIX FIFOs.
Creation and Connection:
- A server creates a named pipe using the
CreateNamedPipe()function. - A client process connects to an existing named pipe using the
ConnectNamedPipe()function (on the server side) andCreateFile()on the client side. - Communication is performed using the standard
ReadFile()andWriteFile()functions.
Advanced Communication Characteristics:
- Bidirectional? Yes. Windows named pipes support full-duplex communication. Data can travel in both directions simultaneously over the same pipe.
- Network Communication? Yes. This is a major advantage. The communicating processes may reside on either the same machine or different machines across a network.
- Data Type: Windows systems are more flexible. They allow the transmission of either byte-oriented data (a stream of bytes) or message-oriented data (discrete messages with boundaries preserved). This is specified when the pipe is created.
- Multiple Clients: A single Windows named pipe server can handle connections from multiple client processes.
Summary Table: Ordinary Pipes vs. Named Pipes¶
| Feature | Ordinary Pipes (UNIX & Windows) | Named Pipes (UNIX FIFOs) | Named Pipes (Windows) |
|---|---|---|---|
| Persistence | Temporary (lasts only while processes run) | Persistent (exists until deleted) | Persistent (exists until deleted) |
| Process Relationship | Required (Parent-Child) | Not Required | Not Required |
| Direction | Unidirectional | Half-Duplex (typically) | Full-Duplex |
| Network Use | No (same machine only) | No (same machine only) | Yes (same or different machines) |
| Data Type | Byte-oriented | Byte-oriented | Byte- or Message-oriented |
3.8 Communication in Client–Server Systems¶
This section expands on the topic of Inter-Process Communication (IPC) by focusing specifically on techniques used in client–server systems. As a reminder from Section 1.10.3, this is a common architecture where a server process provides a service, and a client process requests that service.
Recap and Introduction¶
Previously, in Section 3.4, we learned about two fundamental IPC techniques:
- Shared Memory
- Message Passing
Both of these techniques can be, and are, used to facilitate communication in client-server systems. In this section, we will explore two other, highly important strategies designed for this purpose:
- Sockets
- Remote Procedure Calls (RPCs)
The text also notes that RPCs are not only useful for classic network communication but are also used by the Android operating system as a form of IPC between processes on the same device.
PIPES IN PRACTICE¶
Before diving into sockets and RPCs, the text provides a practical, real-world example of how pipes are used every day in command-line environments.
The UNIX/Linux Example¶
In UNIX, pipes are frequently used on the command line to chain commands together, where the output of one command becomes the input for the next. This is a direct application of the producer-consumer model.
- Scenario: The
lscommand lists all files in a directory. If there are many files, the output scrolls by too fast to read. Thelesscommand is a "pager" that displays output one screen at a time, allowing you to scroll up and down. - The Solution: You can connect these two commands using a pipe.
- The Command:
ls | less - How it works:
- The shell creates an ordinary pipe.
- It then forks two processes: one for
lsand one forless. - The standard output (stdout) of the
lsprocess (the producer) is connected to the write end of the pipe. - The standard input (stdin) of the
lessprocess (the consumer) is connected to the read end of the pipe. - The
lscommand produces the directory listing and writes it into the pipe. Thelesscommand reads this data from the pipe and displays it interactively.
- The pipe symbol on the command line is the vertical bar:
|.
The Windows DOS Shell Example¶
The same concept applies to the Windows command prompt (DOS shell).
- Scenario: The
dircommand produces a directory listing. Themorecommand provides functionality similar to UNIX'sless, pausing output after each screenful. - The Command:
dir | more - How it works: The mechanism is identical to the UNIX example. The shell creates an anonymous pipe, connecting the stdout of
dirto the stdin ofmore. - A Note on Naming: The text points out that UNIX also has a
morecommand, but a command namedlesswas created that offers more features (like scrolling backwards). The joke is that "less is more," meaning thelesscommand provides more functionality than themorecommand.
3.8.1 Sockets¶
This section explains one of the most fundamental networking concepts: Sockets.
What is a Socket?¶
A socket is defined as an endpoint for communication. When two processes need to communicate over a network, each process uses a socket. Therefore, a connection consists of a pair of sockets—one for each process involved.
A socket is uniquely identified by combining two pieces of information:
- IP Address: The network address of the computer (e.g.,
146.86.5.20). - Port Number: A number that specifies a particular service or application on that computer (e.g.,
80for a web server).
The full socket address is written as IP-Address:Port-Number (e.g., 146.86.5.20:1625).
The Client-Server Architecture and Ports¶
Sockets operate on a standard client-server model:
- The server process "listens" for incoming connection requests on a specific, well-known port number. This is like a business having a public phone number.
- Well-known ports (numbers less than 1024) are reserved for standard services. Examples include:
- Port 22: SSH
- Port 21: FTP
- Port 80: HTTP (Web Server)
- Well-known ports (numbers less than 1024) are reserved for standard services. Examples include:
- The client process initiates a connection request. When it does, its host computer assigns it a port number. This client port is an ephemeral port, which is an arbitrary number greater than 1024. It's like the client using a private, temporary phone line to make the call.
Establishing a Connection: A Detailed Example¶
Let's walk through the example from the text:
- The Server: A web server is running on a machine with IP address
161.25.19.8. It is listening for connections on its well-known port, port 80. Its socket is(161.25.19.8:80). - The Client: A client process on a machine with IP address
146.86.5.20wants to connect to that web server. - The Connection: The client's host machine assigns it an available ephemeral port, say port 1625. The client's socket is now
(146.86.5.20:1625). - The Unique Pair: The total connection between the client and the server is defined by the unique pair of sockets:
(146.86.5.20:1625)and(161.25.19.8:80).
Go to Figure 3.26: This figure illustrates this exact scenario. It shows host X (146.86.5.20) with its client socket on port 1625, communicating with the web server (161.25.19.8) on its server socket, port 80. All network packets are delivered to the correct process based on the destination port number.
Connection Uniqueness: If another process on the same client host (146.86.5.20) also wanted to connect to the same web server (161.25.19.8:80), it would be assigned a different ephemeral port (e.g., 1626). This ensures that every connection on the network is a unique socket pair.
Socket Programming in Java¶
The text notes that while many examples in the book use C, it will use Java for sockets because Java provides a cleaner, easier-to-understand interface.
Java provides three main socket types:
Socket(Connection-oriented): Uses the TCP protocol. Reliable, guarantees delivery and order of data.DatagramSocket(Connectionless): Uses the UDP protocol. Unreliable, but faster. No guarantee of delivery or order.MulticastSocket(A subclass ofDatagramSocket): Allows data to be sent to multiple recipients at once.
Example: A Date Server using TCP Sockets¶
The text provides a complete example of a simple date server using a connection-oriented TCP socket. The server listens on port 6013 and, when a client connects, it sends the current date and time and then closes the connection.
Go to Figure 3.27: This is the Java code for the Date Server.
Let's break down how this server works, step-by-step:
- Import Libraries: The code imports
java.net.*(for networking) andjava.io.*(for input/output). - Create a Server Socket: The line
ServerSocket sock = new ServerSocket(6013);creates aServerSocketobject that is bound to port 6013. This is the socket that will listen for incoming client connections. - Listen for Connections (The Main Loop): The server enters an infinite
while (true)loop to continuously handle clients. - Accept a Client Connection: The line
Socket client = sock.accept();is crucial. Theaccept()method blocks—meaning the server process waits here and does nothing else—until a client requests a connection. When a connection request arrives,accept()returns a newSocketobject namedclient. Thisclientsocket is the dedicated communication channel to that specific client. - Establish an Output Stream: The server creates a
PrintWriterobject namedpoutthat is linked to the output stream of theclientsocket (client.getOutputStream()). Thetrueparameter enables auto-flushing, meaning data is sent immediately without waiting for a buffer to fill up. ThisPrintWriterallows the server to use familiarprint()andprintln()methods to send data to the client. - Send the Date: The server gets the current date and time (
new java.util.Date().toString()) and sends it to the client by callingpout.println(). - Close the Client Connection: The server calls
client.close(), which terminates the connection with this specific client. - Repeat: The loop continues, and the server goes back to the
accept()method to wait for the next client connection.
This example clearly illustrates the lifecycle of a simple server: Create, Listen, Accept, Process, Close, Repeat.
The Date Client and Socket Communication Analysis¶
This section completes the socket example by presenting the client's code and then discusses the overall characteristics of socket-based communication.
The Date Client¶
For the date server to be useful, we need a client program to connect to it and display the date it sends. The client's role is to initiate the connection and read the data.
Go to Figure 3.28: This is the Java code for the Date Client.
Let's break down how this client works, step-by-step:
- Import Libraries: The client imports the same
java.net.*andjava.io.*libraries as the server for networking and I/O. - Create and Connect a Socket: The critical line is
Socket sock = new Socket("127.0.0.1", 6013);. This single line does two things:- It creates a client-side
Socketobject. - It immediately attempts to establish a connection to the server located at IP address
127.0.0.1on port6013.
- It creates a client-side
- The Loopback Address (127.0.0.1): This is a special IP address known as the loopback address. When a computer uses this address, it is communicating with itself. This allows you to run both the client and server on the same machine for testing. This address could be replaced with:
- The actual IP address of another machine on the network running the server.
- A hostname (like
www.westminstercollege.edu), which the system automatically resolves to an IP address.
- Establish an Input Stream: Once connected, the client gets an
InputStreamfrom the socket (sock.getInputStream()) and wraps it in aBufferedReader. ThisBufferedReader(namedbin) allows the client to read data from the server line by line using thereadLine()method, which is a convenient high-level operation. - Read from the Server: The client enters a
whileloop, reading each line of text sent by the server. In this simple case, the server only sends one line (the date string). The loop reads this line and prints it to the console (System.out.println(line)). - Close the Connection: After receiving and printing the data, the client closes its socket (
sock.close()) and exits.
Analysis of Socket Communication¶
The text concludes the discussion on sockets by evaluating their place in distributed programming.
- Common and Efficient: Socket communication is widely used and is a performant method for network communication.
- Low-Level Form of Communication: Despite their usefulness, sockets are considered a low-level mechanism. This is because they only provide a channel for an unstructured stream of bytes to be exchanged.
The Problem with Unstructured Bytes: It is the sole responsibility of the application programmer (both the client and server developer) to impose a structure and meaning on this raw byte stream. For example:
- How does the server know how long the client's request message is?
- How does the client know how much data the server is going to send back?
- If multiple pieces of data are sent, how are they separated?
The programmer must design and implement a custom application-level protocol to handle these issues, which can be complex and error-prone.
Conclusion: Because of this low-level, unstructured nature, the text now introduces a higher-level method of communication that abstracts away these complexities: Remote Procedure Calls (RPCs).
3.8.2 Remote Procedure Calls (RPC)¶
This section introduces Remote Procedure Calls (RPC), a high-level communication paradigm designed to make network communication feel as simple as calling a local function.
The RPC Paradigm¶
RPC is one of the most common forms of remote service. Its primary goal is to abstract the complex details of network communication. It makes a procedure (or function) call on a remote machine look and feel, to the programmer, just like a procedure call on the local machine.
- Foundation: It is conceptually similar to the message-passing IPC mechanism from Section 3.4 and is often built on top of such a system.
- Key Difference: Because the processes are on separate systems, it must use a message-based communication scheme over the network.
Structured Messages and the RPC Daemon¶
Unlike the unstructured byte streams of sockets, RPC communication uses well-structured messages.
- Addressing: Each message is addressed to an RPC daemon—a special server process—listening on a specific port on the remote system.
- Message Content: The message doesn't just contain raw data; it contains a structured request specifying:
- An identifier for the specific function to execute.
- The parameters to pass to that function.
- Execution: The RPC daemon receives the message, executes the specified function with the given parameters, and sends back any output in a separate reply message.
Ports Revisited: In this context, a port is a number within a system's network address that differentiates between its many network services. For example, a service for listing current users might be attached to port 3027. A client sends an RPC request to that port to get the user list.
The Client-Server Stubs and Marshaling¶
The "magic" that makes a remote call look local is achieved using stubs on both the client and server sides.
Go to Figure 3.29 (referenced in text): This figure would illustrate the following sequence of steps.
Client Stub: When a client program calls what it thinks is a local procedure, it is actually calling a client-side stub. This stub is a local function that represents the remote procedure.
- The client stub's job is to marshal the parameters. This means it takes the procedure's parameters and packages them into a well-structured network message.
- It then uses message passing (e.g., sockets) to send this message to the server.
Server Stub: On the server side, a server-side stub is waiting for incoming requests.
- It receives the network message.
- It unmarshals the parameters, extracting them from the message.
- It then calls the actual server procedure on the server's machine.
Returning the Result: After the server procedure finishes, the return values are sent back to the client by reversing the process: the server stub marshals the return values, and the client stub unmarshals them and returns them to the client program.
Microsoft Interface Definition Language (MIDL): On Windows systems, this stub code is not written by hand. Instead, programmers write a specification in MIDL, which is then compiled to automatically generate the client and server stubs.
Key Technical Challenges in RPC¶
RPC systems must solve several difficult problems introduced by the network:
1. Parameter Marshaling and Data Representation:
- Problem: Different machines can represent data differently (e.g., big-endian vs. little-endian byte order for integers).
- Solution: The RPC system defines a machine-independent data representation, such as External Data Representation (XDR). The client stub converts local data into XDR before sending; the server stub converts XDR into its local format upon receipt.
2. Call Semantics: Local calls are "exactly once." Due to network errors, RPCs can fail or be duplicated. RPC systems must define their semantics:
- At Most Once: The system ensures that even if multiple duplicate requests are sent, the procedure is executed no more than once. This is often implemented by having the server keep a history of processed requests (e.g., using timestamps) and ignoring duplicates.
- Exactly Once: This is the ideal, but hardest to achieve. It requires the "at most once" mechanism plus a system of acknowledgements (ACKs). The client must resend the request until it receives an ACK from the server confirming the call was received and executed.
3. Binding (Finding the Server): How does a client know which server port to connect to?
- Fixed Port Addresses: The port number is hardcoded at compile time. This is simple but inflexible.
- Dynamic Binding (Rendezvous): A more flexible approach uses a rendezvous daemon (or matchmaker) on a well-known port. The client first asks this daemon, "Where can I find service X?" The daemon replies with the port number of the correct RPC server. The client then communicates directly with that server.
Application: Distributed File Systems (DFS)¶
RPC is extremely useful for building distributed systems. A prime example is a Distributed File System (DFS), covered in detail in Chapter 19.
- The DFS can be implemented as a set of RPC daemons.
- A client sends an RPC message to the server's DFS port. The message contains a file operation to perform, such as
read(),write(), ordelete(). - The DFS daemon on the server executes the operation on its local file system and sends back the result (e.g., the contents of a file) in the return message.
- A request might be for a single block of a file, requiring multiple RPCs to transfer an entire file.
3.8.2.1 Android RPC¶
This section explains how the Remote Procedure Call (RPC) paradigm is adapted for use within a single device by the Android operating system.
RPC as a Form of IPC¶
While RPCs are typically used for communication between different machines in a network, they are also a powerful tool for Inter-Process Communication (IPC) between processes on the same system. Android uses this concept extensively in its Binder framework, which is its core IPC system.
Android Application Components and Services¶
To understand Android RPC, you first need to know about Android's building blocks:
- Application Component: A basic building block of an Android app (like an Activity, Broadcast Receiver, or Service).
- Service: An application component that runs in the background, with no user interface, to perform long-running operations or work for other processes. Examples include playing music or fetching data from the network without blocking the main app.
- Binding a Service: A client app can connect to, or "bind", a service by calling the
bindService()method. Once bound, the client and service can establish a client-server communication channel using either message passing or RPCs.
Two IPC Methods in Android Services¶
A bound service must extend the Android Service class and implement the onBind() method. This method is called when a client binds to it and determines the IPC method.
1. Message Passing with Messenger:
- The
onBind()method returns aMessengerobject. - This allows the client to send one-way messages to the service.
- For two-way communication: The client must include its own
Messengerin thereplyTofield of the message it sends. The service can then use this clientMessengerto send messages back.
2. Remote Procedure Calls (RPC) with AIDL: This is the more powerful and structured method, which allows the client to call methods on the service as if it were a local object.
Implementing RPC with AIDL¶
To provide RPCs, the service must return an interface for the remote object. This is done using the Android Interface Definition Language (AIDL).
Step 1: Define the AIDL Interface
The programmer creates a file (e.g., RemoteService.aidl) that defines the interface for the remote service using Java-like syntax.
aidl
/* RemoteService.aidl */
interface RemoteService {
boolean remoteMethod(int x, double y);
}
This interface declares that there is a remote method called remoteMethod that takes an int and a double as parameters and returns a boolean.
Step 2: Automatic Stub Generation
The Android SDK automatically uses this .aidl file to generate the necessary Java code, including the client and server stubs. This is similar to how MIDL works on Windows. The programmer does not have to write the network communication or marshaling code.
Step 3: Server Implementation
The server (the Service) must provide a concrete implementation of the interface generated from the .aidl file. This implementation contains the actual logic that runs when remoteMethod() is called.
Step 4: Client Invocation
When a client calls bindService(), the service's onBind() method is triggered. It returns the generated stub object to the client. The client can then use this stub to call the remote method directly:
RemoteService service; // This is actually the stub object
...
boolean result = service.remoteMethod(3, 0.14); // Calls the remote service
The Role of the Binder Framework¶
Internally, the Android Binder framework handles all the complex, low-level work transparently:
- Parameter Marshaling: It converts the parameters (
3and0.14) into a format suitable for cross-process transmission. - Inter-Process Transfer: It passes the marshaled data from the client process to the server process.
- Method Invocation: It invokes the correct method (
remoteMethod) on the server's implementation object. - Return Value: It marshals the return value (
result) and sends it back to the client process.
In summary, Android RPC provides a high-level, programmer-friendly way for apps to communicate and share functionality securely, abstracting away all the underlying complexity of process boundaries and data marshaling.
3.9 Summary¶
This chapter covered the fundamental concept of a process and how processes communicate with each other. Here is a concise summary of the key points.
1. The Process Concept¶
- A process is a program in execution.
- Its current activity is tracked by the program counter (PC) and the contents of the CPU registers.
- A process in memory is divided into four sections:
- Text: The executable code.
- Data: Global variables.
- Heap: Dynamically allocated memory during runtime.
- Stack: Temporary data (function parameters, return addresses, local variables).
2. Process State and Management¶
- A process changes state as it executes. The general states are:
- Ready: Waiting to be assigned to a CPU.
- Running: Instructions are being executed.
- Waiting: Waiting for some event (like I/O completion).
- Terminated: Finished execution.
- The operating system represents each process with a Process Control Block (PCB), a kernel data structure containing all information about a process (PID, state, PC, etc.).
- The process scheduler selects from among the ready processes which one to run next on the CPU.
- A context switch is the operation where the kernel saves the state of the currently running process and restores the state of a new process to run.
3. Process Creation¶
- Processes are created using system calls:
- UNIX/Linux:
fork()system call. - Windows:
CreateProcess()function.
- UNIX/Linux:
4. Interprocess Communication (IPC)¶
Processes can communicate using two main models:
A. Shared Memory
- Two or more processes share the same region of memory.
- POSIX provides a standard API for setting up shared memory.
B. Message Passing
- Processes communicate by exchanging messages.
- Mach and Windows use message passing as a primary IPC mechanism.
- Windows uses the Advanced Local Procedure Call (ALPC) facility for local message passing.
5. Pipes¶
A pipe is a conduit for two processes to communicate. There are two types:
A. Ordinary Pipes
- Designed for communication between processes with a parent-child relationship.
- UNIX: Created with
pipe(). Has a read end (fd[0]) and a write end (fd[1]). - Windows: Called Anonymous Pipes. Created with
CreatePipe(). Unidirectional and require a parent-child relationship.
B. Named Pipes
- More general; allow communication between unrelated processes.
- UNIX: Called FIFOs. Created with
mkfifo(). Appear as files in the filesystem. Typically half-duplex. - Windows: Named Pipes. Richer than FIFOs; support full-duplex communication and can be used over a network.
6. Client-Server Communication¶
A. Sockets
- An endpoint for communication, defined by an IP address and a port number.
- Allow two processes on different machines to communicate over a network.
- Considered a low-level mechanism as it only provides an unstructured byte stream.
B. Remote Procedure Calls (RPCs)
- A high-level abstraction that makes a procedure call on a remote machine appear as a local call.
- Uses stubs on the client and server to handle marshaling (converting parameters to a network format) and communication.
- Android RPC: The Android OS uses a form of RPC (via its Binder framework and AIDL) for IPC between processes on the same device, allowing one process to invoke services in another.
Chapter 4: Introduction to Threads and Concurrency¶
1. The Shift from Single-Threaded to Multi-Threaded Processes¶
In the previous chapter (Chapter 3), a process was defined as a program in execution. A key part of that definition was that it had a single thread of control. Think of this as a single pointer (like the Program Counter in your computer architecture knowledge) moving sequentially through the instructions of the program. Even if the program was doing multiple things, it was doing them one after another.
However, this model is limited. Modern operating systems allow a single process to contain multiple threads of control. Each thread is like a separate, lightweight agent within the process that can execute code independently.
2. Why Threads? The Drive for Parallelism¶
The primary motivation for using threads is to exploit parallelism. This has become critically important with the advent of multicore systems (CPUs with multiple processing cores).
- Analogy: Imagine a kitchen with one chef (a single-threaded process). The chef must chop vegetables, boil water, and season the meat one task at a. Now, imagine a kitchen with multiple chefs (multiple threads within one process). They can work on different tasks simultaneously, dramatically speeding up the overall cooking process.
- Technical Connection: Your computer architecture course taught you about multicore processors. Threads are the software mechanism that allows you to actually use those multiple cores effectively within a single application. Without threads, a process could only ever run on one CPU core at a time, wasting the potential of the others.
3. What This Chapter Will Cover¶
This chapter serves as a roadmap. It will introduce you to:
- The Concepts and Challenges: Using multiple threads isn't simple. We will discuss the new problems that arise, such as coordination and data sharing between threads.
- Threading APIs: We will look at the specific programming interfaces used to create and manage threads, focusing on three major libraries:
- Pthreads: The standard threading API for UNIX/Linux-like systems.
- Windows Threads: The threading API for the Windows operating system.
- Java Threads: How threading is implemented in the Java programming language, which manages threads through its Virtual Machine.
- Higher-Level Abstractions: We will explore modern features that hide the complex details of thread management from the programmer. The goal here is to let developers focus on identifying what can be done in parallel, while the system's frameworks handle the "how" of creating and managing the threads.
- Operating System Design Impact: We will examine how the existence of threads influences the design of the operating system kernel itself.
- Kernel-Level Thread Support: Finally, we will take a close look at how two real-world operating systems, Windows and Linux, implement and support threads within their kernels.
This chapter lays the foundation for understanding how modern software can efficiently utilize modern hardware.
4.1 Overview: What is a Thread?¶
Think of a thread as a lightweight, streamlined version of a process. You already know from computer architecture that a process requires a lot of overhead: it has its own dedicated memory space (code, data, heap, stack), file descriptors, and more. A thread is a way to break a process down into smaller, executable units.
The Technical Definition: A thread is a basic unit of CPU utilization. Its core components, which you are familiar with from the CPU's perspective, are:
- Thread ID: A unique identifier.
- Program Counter (PC): Tracks which instruction to execute next.
- Register Set: Holds the current working variables of the thread.
- Stack: Contains the thread's execution history (function calls, local variables).
The Key Difference from a Process: While a thread has these independent components, it shares everything else with other threads belonging to the same process. This includes:
- The code section (the program instructions).
- The data section (global variables).
- Other OS resources like open files and signals.
This makes threads "lightweight" because creating a new thread doesn't require duplicating all of this shared context, unlike creating a new process.
Single-threaded vs. Multithreaded Processes:
- A traditional, single-threaded process has just one "thread of control." One program counter, one sequence of instructions.
- A multithreaded process has multiple threads, meaning it can have multiple program counters and multiple sequences of instructions executing at the same time (or seemingly at the same time on a single core).
Refer to Figure 4.1: This figure perfectly illustrates the difference. The single-threaded process has one set of registers, one stack, and one PC. The multithreaded process has multiple sets of registers, multiple stacks, and multiple PCs, all operating within a single shared code and data space.
4.1.1 Motivation: Why Do We Need Threads?¶
Virtually all modern applications use threads. The motivation comes from two main areas: improving application responsiveness and efficiently using modern hardware.
1. Improved Application Structure and Responsiveness: Even on a single-core CPU, threads can make an application feel more responsive by allowing it to perform multiple tasks concurrently. The OS rapidly switches between threads, giving the user the illusion that many things are happening at once.
Real-world examples from your text:
- Photo Thumbnail Generator: A separate thread can be spawned to process each image. This is more efficient than processing one image completely before starting the next.
- Web Browser: One thread can handle the UI (displaying content) while another thread performs network communication (downloading data). This prevents the entire browser from freezing while waiting for a slow network.
- Word Processor: One thread for user input, another for display, and a third for background spell-checking. This allows you to keep typing while the program checks your spelling.
2. Leveraging Multicore Architectures: You learned in computer architecture that modern CPUs have multiple cores. A single-threaded process can only run on one core at a time, leaving the others idle. A multithreaded application can be designed to run its threads on multiple cores simultaneously, leading to a true performance increase for CPU-intensive tasks like data mining, scientific computing, and graphics rendering.
3. The Server Example: Economic Resource Usage This is a critical concept. Let's analyze the web server example in detail.
- The Problem: A busy web server gets thousands of requests. A single-threaded server can only handle one request at a time. Everyone else waits.
- The Old Solution: Process-per-request. The server would create a brand new process for each incoming request. As you know, process creation is heavyweight—it involves duplicating the entire parent's memory space, which is slow and consumes a lot of RAM.
- The Modern Solution: Thread-per-request. The server runs as a single, multithreaded process. One main thread listens for connections. When a request comes in, the server creates a new thread to handle it. Thread creation is lightweight because the new thread shares the existing process's code, data, and files. It only needs its own stack and registers. This is vastly more efficient and scalable.
Refer to Figure 4.2: This diagram shows the server architecture. Step 1: A request comes in. Step 2: The server creates a new thread inside its own process to service that request. Step 3: The main thread immediately goes back to listening for more requests.
4. Operating System Kernels:
Even the OS itself uses threads. The Linux kernel, for example, starts threads like kthreadd (you can see it using ps -ef) to manage specific tasks like handling devices or memory. This makes the kernel more modular and responsive.
In Summary: Threads provide a powerful tool for programmers to design applications that are more responsive, resource-efficient, and capable of exploiting the parallel processing power of multicore systems.
4.1.2 Benefits of Multithreaded Programming¶
Using multiple threads within a single program provides several major advantages. These benefits are often categorized into four key areas, which build directly on the concepts you learned in computer architecture about processes and context switching.
1. Responsiveness¶
What it means: Multithreading can keep an application responsive and interactive even when parts of it are busy with long-running tasks.
Detailed Explanation: In a single-threaded application, the entire program has only one thread of control. If that one thread becomes busy (e.g., performing a complex calculation, reading a large file from a slow disk, or waiting for a network response), the entire program freezes. The user interface (UI) will not update and will not respond to any user input.
In a multithreaded design, the time-consuming task is assigned to a background thread (also called a worker thread). The main thread, which is responsible for the UI, remains free to continuously listen for and react to user actions like clicks and keystrokes. This gives the user immediate feedback, such as showing a progress bar or allowing them to cancel the operation, making the application feel much smoother and more responsive.
- Example from text: When a user clicks a button that triggers a long operation, a separate thread handles the operation, so the UI doesn't lock up.
2. Resource Sharing¶
What it means: Threads within the same process can easily and automatically share data and resources with each other.
Detailed Explanation: By default, threads share the entire address space of their parent process—this includes the code, global variables, and heap memory. This is a built-in feature of the threading model.
In contrast, processes are isolated from each other by the operating system. For processes to share data, the programmer must use more complex and slower Inter-Process Communication (IPC) techniques, such as:
- Shared Memory: A region of memory is explicitly set up to be accessible by multiple processes.
- Message Passing: Data is explicitly copied from one process's address space to another's using mechanisms like pipes or message queues.
Threads avoid this overhead. This inherent sharing allows for efficient collaboration, such as multiple threads working on different parts of the same large data structure.
3. Economy¶
What it means: Threads are significantly cheaper to create and manage than processes.
Detailed Explanation: This point relates directly to performance and resource usage. Creating a process is a heavy-weight operation because the operating system must:
- Allocate a new and separate memory space for code, data, heap, and stack.
- Set up numerous kernel data structures to manage the process (e.g., Process Control Block, file descriptor table).
Creating a thread is a light-weight operation because the new thread:
- Shares the existing process's code, data, and files.
- Only requires the OS to allocate a small, new stack and a Thread Control Block (TCB) to manage its register set, PC, and state.
Furthermore, context switching between threads is faster than between processes. A thread context switch within the same process does not require invalidating the Translation Lookaside Buffer (TLB) because the memory address space does not change. A process context switch requires switching the memory address space, which often leads to a TLB flush, causing more cache misses and a higher performance penalty.
4. Scalability¶
What it means: Multithreaded applications can directly exploit the power of systems with multiple CPUs or multiple processor cores.
Detailed Explanation: This is a critical benefit for modern multicore systems, which you studied in computer architecture. A single-threaded process is inherently limited to running on a single processor core at any given moment, no matter how many idle cores are available.
A multithreaded application, however, can be designed so that its workload is divided among several threads. The operating system can then schedule these threads to run simultaneously on different cores. This allows for true parallel execution, dramatically reducing the total time required for CPU-intensive tasks like rendering a video, simulating a physical system, or querying a large database. The more cores available, the more a well-designed multithreaded application can scale its performance.
4.2 Multicore Programming¶
This section discusses the fundamental shift in computing from single-core to multi-core processors and how this changes the way we think about threads.
The Shift to Multicore Systems¶
Historical Context: The demand for more computing power first led to systems with multiple, separate CPU chips on a single motherboard. A more recent and prevalent trend is to place multiple independent processing units, called cores, onto a single chip. Each core is capable of executing its own stream of instructions. From the operating system's perspective, each core appears as a separate CPU.
The Role of Multithreading: Multithreaded programming is the primary mechanism for an application to take full advantage of these multiple cores, leading to more efficient resource use and true parallel execution.
Concurrency vs. Parallelism¶
This is a critical distinction that builds directly on your knowledge of CPU scheduling.
Concurrency:
- A system is concurrent if it supports more than one task making progress.
- It does not require multiple cores. On a single-core system, concurrency is achieved by the operating system's scheduler rapidly switching the single CPU between multiple threads (a context switch you're familiar with). Each thread gets a small time slice to run, creating the illusion of simultaneous execution.
- Refer to Figure 4.3: This diagram shows concurrent execution on a single-core system. Threads T1, T2, T3, and T4 are all making progress over time, but only one is actually executing on the CPU at any given instant. Their execution is interleaved.
Parallelism:
- A system exhibits parallelism if it can perform more than one task simultaneously.
- This requires a multicore (or multiprocessor) system. The OS can schedule different threads to run on different cores at the exact same time.
- Refer to Figure 4.4: This diagram (on the following page in your text) shows parallel execution on a multicore system. Threads T1, T2, T3, and T4 are each assigned to their own core and are executing simultaneously.
The Key Takeaway: You can have concurrency without parallelism, but you cannot have parallelism without a system that supports concurrent execution (like a multicore CPU).
Before multicore systems were common, single-CPU computers used concurrency (via time-sharing) to allow multiple applications to run "at the same time," even though they were never actually executing instructions in parallel. Multicore systems allow us to move from this interleaved concurrency to true parallelism.
4.2.1 Programming Challenges¶
While multicore systems offer the potential for massive performance gains, this potential comes with significant new challenges for programmers. Simply having multiple cores does not make an application faster; the application must be explicitly designed to leverage them. This shift is so fundamental that it is changing how software is designed and how computer science is taught.
Here are the five key challenges in programming for multicore systems:
1. Identifying Tasks¶
The Challenge: The programmer must analyze an application and decompose its workload into discrete units of work, or tasks, that can be executed concurrently.
Detailed Explanation: This is the first and most crucial step. You need to find parts of the program that can run at the same time. The ideal scenario is to identify tasks that are independent—meaning they do not need to communicate with or wait for each other. This allows them to run in parallel on different cores with minimal coordination overhead.
- Example: In a photo thumbnail application, creating a thumbnail for each image is a perfectly independent task. These tasks are easy to run in parallel.
2. Balance¶
The Challenge: After identifying tasks, the programmer must ensure that the work is distributed evenly across the tasks. This is also known as load balancing.
Detailed Explanation: If you spawn eight threads but one thread does 90% of the work while the other seven finish quickly and sit idle, you have gained very little from having eight cores. The goal is to divide the work so that all cores are kept busy for roughly the same amount of time. Furthermore, all tasks should be of roughly "equal value"; it's not efficient to dedicate a core to a trivial task whose setup and management costs more than just executing it on a single core.
3. Data Splitting¶
The Challenge: The data that the application operates on must be divided up to match the division of tasks.
Detailed Explanation: You can't just split the code; you must also split the data it processes. If multiple threads need to work on a large array, for example, you should split the array into chunks so that each thread can work on its own separate portion of the data. This minimizes conflicts and is the foundation for data parallelism.
Refer to Figure 4.5: This figure illustrates the two fundamental parallel programming models. In Data Parallelism, the same task (operation) is performed in parallel on different subsets of the same data. In Task Parallelism, different threads execute different tasks (functions) on potentially the same or different data.
4. Data Dependency¶
The Challenge: Tasks often are not independent and must share data or communicate results. This creates dependencies where one task cannot proceed until another has finished a certain part of its work.
Detailed Explanation: This is one of the hardest parts of concurrent programming. If Task B requires a result computed by Task A, the program must be synchronized to ensure Task B does not read the data before Task A has finished writing it. Incorrect handling of data dependencies leads to severe bugs like race conditions, where the program's outcome depends on the unpredictable timing of the threads. Strategies to handle this (e.g., mutex locks, semaphores) are complex and will be covered in detail in Chapter 6.
5. Testing and Debugging¶
The Challenge: Testing and debugging multithreaded programs is inherently more difficult than for single-threaded programs.
Detailed Explanation: A single-threaded program has one, predictable path of execution. A multithreaded program has a vast number of possible execution paths because the operating system scheduler can interleave the threads in many different orders across the cores. A bug that only occurs under a very specific and rare timing of events may be nearly impossible to reproduce and find. Traditional debugging techniques are often inadequate for these types of concurrency bugs.
Amdahl's Law¶
Amdahl's Law is a fundamental principle in computer architecture and parallel computing that provides a realistic prediction of the maximum potential speedup a program can achieve when additional computing cores are added. It highlights a critical limitation: the serial (non-parallel) portion of a program ultimately limits its scalability.
The Formula and Its Components¶
The formula for Amdahl's Law is:
speedup ≤ 1 / (S + (1 - S)/N)
Where:
- speedup: The theoretical maximum speedup of the entire task.
- S: The serial fraction of the program. This is the portion that cannot be parallelized and must run on a single core.
- N: The number of processing cores.
- (1 - S): The parallel fraction of the program. This is the portion that can be divided and executed across multiple cores.
Detailed Explanation and Example¶
The formula calculates the total time the program would take on a multi-core system and compares it to the time on a single-core system. The new execution time is the sum of the serial part and the parallel part divided by N.
Let's use the example from your text:
- The application is 75% parallel, meaning the parallel fraction
(1 - S) = 0.75. - Therefore, the serial fraction
S = 0.25(or 25%).
With 2 processing cores (N=2): speedup ≤ 1 / (0.25 + (0.75)/2) = 1 / (0.25 + 0.375) = 1 / 0.625 = 1.6
This means the best speedup we can expect on two cores is 1.6 times faster than on a single core.
With 4 processing cores (N=4): speedup ≤ 1 / (0.25 + (0.75)/4) = 1 / (0.25 + 0.1875) = 1 / 0.4375 ≈ 2.28
Adding two more cores provides a performance increase, but it is not a linear scaling (we didn't get 3.2x speedup).
The Critical Implication: The Law of Diminishing Returns¶
The most important insight from Amdahl's Law is that as the number of cores (N) approaches infinity, the maximum possible speedup converges to 1/S.
From the example: If S = 0.5 (50% of the program is serial), the maximum speedup is:
speedup ≤ 1 / 0.5 = 2
This means that no matter how many thousands of cores you add—if half of your program is inherently sequential—the absolute maximum you can make it run is two times faster. This is why the serial portion has a "disproportionate effect." A small serial component creates a hard ceiling on performance.
Refer to the Graph: The graph in your text visually demonstrates this principle.
- The Ideal Speedup line shows a perfect world where speedup is equal to the number of cores (N). This is impossible to achieve due to serial sections.
- The other lines show real-world scenarios with different serial fractions (S).
- You can see that as the number of cores increases, the speedup curves flatten out, approaching their maximum limit of 1/S. The higher the serial fraction (S), the lower the ceiling and the quicker the curve flattens.
In summary: Amdahl's Law teaches programmers that to achieve significant speedups on multicore systems, the primary goal must be to minimize the serial portion (S) of their code. Designing algorithms that are as parallel as possible is essential.
4.2.2 Types of Parallelism¶
To effectively design programs for multicore systems, we categorize parallel execution into two main strategies. These strategies determine how work is divided among threads and cores.
1. Data Parallelism¶
Definition: Data parallelism focuses on distributing subsets of the same data across multiple computing cores and having each core perform the same operation on its subset.
Detailed Explanation: In this model, the operation (the task) is fixed and singular. The key is to split the data into chunks that can be processed independently and simultaneously. This is often easier to implement when the operation on one piece of data does not depend on the result from another.
Example from the text: Summing the contents of a large array.
- On a single-core system, one thread iterates through all N elements.
- On a dual-core system:
- Thread A (on Core 0) sums the first half of the array: elements
[0] ... [N/2 - 1]. - Thread B (on Core 1) sums the second half of the array: elements
[N/2] ... [N - 1].
- Thread A (on Core 0) sums the first half of the array: elements
- After both threads finish, their partial sums are combined to get the final total.
The core operation (summation) is identical for both threads; only the data they work on is different.
2. Task Parallelism¶
Definition: Task parallelism involves distributing distinct tasks (threads) across multiple computing cores. Each thread can perform a unique operation, and they may operate on the same data, different data, or a combination.
Detailed Explanation: This model emphasizes the diversity of the work being done. Instead of repeating the same operation on different data, different cores are executing entirely different functions or parts of the program's logic concurrently.
Example from the text: Performing statistical operations on an array.
- Thread A (on Core 0) could be calculating the average of the array.
- Thread B (on Core 1) could simultaneously be finding the maximum and minimum values in the same array.
Here, the threads are performing different tasks (calculate average vs. find min/max). They might both be reading from the same array, but they are executing different code.
Refer to Figure 4.5: This figure provides a clear visual distinction.
- Data Parallelism is shown with the same task (same shape) being applied to different blocks of data on each core.
- Task Parallelism is shown with different tasks (different shapes) being executed on each core.
Hybrid Approach¶
The text makes a crucial point: data and task parallelism are not mutually exclusive. A complex application will often use a hybrid strategy.
Example: A video processing application might use:
- Task Parallelism: One set of threads handles decoding the video, another set handles processing the audio, and a third handles applying visual effects.
- Data Parallelism within a Task: The "visual effects" task could itself use data parallelism, where multiple threads work on different portions of the same video frame to apply a filter more quickly.
Understanding these two types allows a programmer to choose the right strategy—or combination of strategies—to most effectively decompose a problem for parallel execution.
4.3 Multithreading Models¶
This section explains the crucial relationship between the threads an application uses and how the operating system sees them. This relationship is fundamental to how your multithreaded program actually runs on the hardware.
User Threads vs. Kernel Threads¶
First, we must define two key concepts:
- User Threads: These are threads that are managed entirely by a user-level thread library (like Java's threads or POSIX Pthreads) without direct support from the operating system kernel. The OS is completely unaware of these threads; it only sees the process as a single unit.
- Kernel Threads: These are threads that are supported and managed directly by the operating system kernel. The kernel schedules kernel threads, not processes.
Refer to Figure 4.6: This figure illustrates the fundamental relationship. The user threads exist in the application's user space, while the kernel threads are managed within the OS kernel space. The "Multithreading Model" is the mapping that connects these two layers.
4.3.1 Many-to-One Model¶
Definition: The Many-to-One model maps many user-level threads to a single kernel thread.
Detailed Explanation: In this model, thread creation, scheduling, and management (all the work of context switching between threads) are done by a thread library in the user space. This is very efficient because it doesn't require making a system call (a request to the kernel) for every thread operation like a context switch.
However, this model has two major, critical disadvantages:
Blocking System Calls: Since the OS kernel only sees one kernel thread (the entire process), if one user thread makes a blocking system call (like reading from a slow disk or network), the single kernel thread blocks. This causes the entire process to block, freezing all other user threads inside that process, even if they have work they could be doing.
No True Parallelism: Because there is only one kernel thread associated with the process, the OS can only schedule that one thread on one CPU core. Even on a multicore system, the multiple user threads within the process cannot run in parallel on different cores; they can only run concurrently through interleaving on a single core.
Refer to Figure 4.7: This diagram shows the many-to-one model. Notice how multiple user threads in the user space all funnel down through a single, central connection to one kernel thread in the kernel space.
Historical Context and Modern Relevance: The text mentions Green threads, which used this model and were part of early Java versions. Because of the severe limitations listed above, the many-to-one model has been largely abandoned in modern operating systems and programming languages. Its inability to leverage multiple cores makes it unsuitable for contemporary hardware.
4.3.2 One-to-One Model¶
Definition: The One-to-One model maps each user-level thread to its own dedicated kernel thread.
Detailed Explanation¶
This model creates a direct, one-to-one correspondence between the threads your application uses (user threads) and the entities the operating system scheduler manages (kernel threads).
Advantages:
True Parallelism and Concurrency: This is the primary advantage. Because each user thread is backed by its own kernel thread, the operating system scheduler can assign different threads to different CPU cores. This allows a multithreaded process to achieve true parallel execution on a multicore system.
No Blocking on System Calls: If one user thread makes a blocking system call (e.g., waiting for I/O), only that one kernel thread blocks. The OS scheduler can immediately switch to another, ready-to-run kernel thread within the same process (or a different process). This keeps the CPU busy and allows other threads in the application to continue making progress.
Disadvantage:
- Overhead and Scalability Limitation: The major drawback is overhead. Creating a user thread necessitates creating a corresponding kernel thread. Kernel thread creation is a heavyweight operation that consumes significant kernel resources and time. Furthermore, each kernel thread requires memory for its kernel-level data structures (like a Thread Control Block). As a result, there is often a hard limit on the total number of threads a system can support, and creating a very large number of threads (thousands) can severely burden the system and degrade overall performance.
Refer to Figure 4.8: This diagram illustrates the one-to-one model. You can see a direct, one-to-one mapping between each user thread in user space and each kernel thread in kernel space.
Modern Usage: This is the dominant model used in modern major operating systems. Linux, Windows, and macOS all implement the one-to-one model. This is why these systems can effectively leverage multicore processors for multithreaded applications, despite the higher per-thread cost.
4.3.3 Many-to-Many Model¶
Definition: The Many-to-Many model multiplexes (or maps) many user-level threads to a smaller or equal number of kernel threads.
Detailed Explanation¶
This model attempts to combine the best features of the previous two models while eliminating their main drawbacks.
How it Works: A thread library in user space manages a pool of user threads. The application can create a very large number of these user threads. This pool is then executed by a smaller set of kernel threads, the number of which is controlled by the OS or the thread library, often based on the number of available CPU cores.
Advantages:
Concurrency and Parallelism: Unlike the many-to-one model, the presence of multiple kernel threads allows the OS to schedule them on multiple cores, enabling true parallelism. If one kernel thread blocks on a system call, the OS can immediately schedule another kernel thread from the same process.
Efficiency and Scalability: Unlike the one-to-one model, the application is not limited by the high cost of kernel threads. A programmer can create thousands of user threads without overburdening the OS, as only a manageable number of kernel threads are ever in existence. This provides great flexibility.
Refer to Figure 4.9: This diagram shows the many-to-many model. Notice how a large number of user threads are mapped to a smaller pool of kernel threads.
The Two-Level Model¶
Definition: The Two-Level model is a variation of the many-to-many model. It multiplexes many user threads to a pool of kernel threads, but it also allows a certain number of user threads to be permanently bound to a dedicated kernel thread.
Detailed Explanation: This model adds an extra layer of flexibility. It allows the thread library to treat some user threads as "special" by giving them a direct, one-to-one connection to a kernel thread (like in the one-to-one model). This is useful for threads that are particularly important or perform a lot of blocking I/O, ensuring they get scheduled immediately by the kernel. The rest of the user threads share the remaining kernel threads in the pool (like in the many-to-many model).
Refer to Figure 4.10: This diagram illustrates the two-level model. You can see that most user threads are multiplexed, but one user thread has a direct, bound connection to its own kernel thread.
Modern Context and Implementation¶
The text makes two critical points about the modern relevance of these models:
Implementation Difficulty: The many-to-many and two-level models are complex to implement correctly in the operating system kernel.
Shift to the One-to-One Model: With the proliferation of multicore systems, the primary advantage of the many-to-many model—limiting kernel threads to save resources—has become less important. The hardware can now support many kernel threads running in parallel. Consequently, most modern general-purpose operating systems (Windows, Linux, macOS) have standardized on the simpler and high-performance one-to-one model.
However, the concept of the many-to-many model is experiencing a resurgence at the library level. Modern concurrency frameworks (like Java's ExecutorService or the Grand Central Dispatch system in Apple's ecosystems) allow programmers to define many "tasks." The library then manages a pool of worker threads (the kernel threads in a one-to-one system) to execute these tasks, effectively implementing the many-to-many model's logic in user space.
4.4 Thread Libraries¶
A thread library is an Application Programming Interface (API) that provides programmers with a set of functions to create, manage, and synchronize threads.
Implementation Methods¶
There are two primary ways to implement a thread library, which correspond directly to the user and kernel thread concepts from the previous section:
User-level Thread Library:
- The entire library—all code and data structures—exists in user space.
- How it works: When your program calls a thread function (like
pthread_create()), it is a local function call within your process. It does not require a system call to the kernel. - Implication: This is very fast and efficient, but it typically relies on the many-to-one model, meaning a blocking call by one thread will block the entire process.
Kernel-level Thread Library:
- The library is implemented and supported directly by the operating system kernel.
- How it works: When your program calls a thread function, it triggers a system call into the kernel.
- Implication: This has more overhead than a user-level call but allows the OS to manage the threads directly, enabling true parallelism and preventing one thread from blocking the entire process. This is the model used by modern one-to-one systems.
Major Thread Libraries in Use¶
The text introduces three main thread libraries:
- POSIX Pthreads: A standardized threading interface for UNIX-like systems (Linux, macOS). It can be implemented as either a user-level or a kernel-level library.
- Windows Threads: The kernel-level threading API for the Windows operating system.
- Java Threads: The threading API provided by the Java programming language. Since the Java Virtual Machine (JVM) runs on a host OS, Java threads are ultimately implemented using the native thread library of that host system (e.g., Pthreads on Linux, Windows threads on Windows).
A Key Difference in Data Sharing:
- In Pthreads and Windows, any variable declared globally (outside of any function) is automatically shared by all threads in the process.
- In Java, there is no direct concept of global data. Sharing data between threads must be done explicitly, typically by having threads reference the same shared object.
Thread Creation Strategies: Asynchronous vs. Synchronous¶
Before looking at code examples, it's important to understand the two general strategies for how a parent thread manages its child threads.
1. Asynchronous Threading
- Behavior: The parent thread creates a child thread and then immediately continues its own execution. The parent and child run concurrently and independently.
- Data Sharing: There is typically little to no data sharing between the threads.
- Use Cases:
- The multithreaded server from Figure 4.2, where the main thread immediately goes back to listening for new requests.
- Responsive user interfaces, where a background thread handles a long task while the UI thread remains interactive.
2. Synchronous Threading
- Behavior: The parent thread creates one or more children and then must wait for all of them to finish their work and terminate before it can resume its own execution. This waiting for a thread to finish is often called joining.
- Data Sharing: This model often involves significant data sharing. The parent thread frequently needs to collect and combine the results computed by its children.
- Use Case: The upcoming summation example uses this strategy. The main thread will create a worker thread to calculate the sum, wait for it to finish, and then use the result.
The text states that all the following code examples for Pthreads, Windows, and Java will use the synchronous threading strategy.
4.4.1 Pthreads¶
Pthreads is a standardized specification (IEEE 1003.1c) for an API to create and manage threads. It is a specification, not an implementation, meaning different operating systems can implement it as they see fit, but they must provide the same core functions. It is commonly found on UNIX-type systems like Linux and macOS.
Program Walkthrough¶
1. Includes and Global Data
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
int sum; /* this data is shared by the thread(s) */
void *runner(void *param); /* threads call this function */
pthread.his the essential header for all Pthreads functions and data types.int sum;is a global variable. In Pthreads (and C), global data is automatically shared by all threads in the process. This is the variable where the child thread will store its result.
2. Main Function - The Parent Thread
int main(int argc, char *argv[])
{
pthread_t tid; /* the thread identifier */
pthread_attr_t attr; /* set of thread attributes */
pthread_t tid;declares a variable to hold the unique Thread ID.pthread_attr_t attr;declares a structure to hold thread attributes (like stack size, scheduling policy). We will use the defaults.
3. Initialization and Thread Creation
pthread_attr_init(&attr); /* set the default attributes */
pthread_create(&tid, &attr, runner, argv[1]); /* create the thread */
pthread_attr_init(&attr)initializes the attribute structure with default values.pthread_create()is the crucial call that creates the new thread. It takes four arguments:&tid: A pointer to the thread ID variable, which will be filled in by the system.&attr: A pointer to the thread attributes.runner: The name of the function where the new thread will start execution.argv[1]: The argument to pass to therunnerfunction. In this case, it's the command-line argument (the number to sum up to).
At this point, two threads are running concurrently:
- The main/parent thread continues executing in the
main()function. - The child thread begins executing in the
runner()function.
4. Synchronous Join
pthread_join(tid, NULL); /* wait for the thread to exit */
printf("sum = %d\n", sum);
pthread_join(tid, NULL)is how the parent thread performs a synchronous wait. Themainthread will block here until the child thread with IDtidterminates.- This ensures that the
printfstatement does not execute until after the child thread has finished calculating thesum.
5. The Runner Function - The Child Thread
void *runner(void *param)
{
int i, upper = atoi(param);
sum = 0;
for (i = 1; i <= upper; i++)
sum += i;
pthread_exit(0);
}
- This is the function the child thread executes. Its signature must be
void *func_name(void *param). atoi(param)converts the command-line string argument (argv[1]) passed frommain()into an integer.- The
forloop performs the summation and stores the result in the shared global variablesum. pthread_exit(0)is how a thread explicitly terminates itself. The0is an exit status that could be retrieved bypthread_join.
Joining Multiple Threads¶
The text in Figure 4.12 shows a common pattern for when a program creates many threads. You can store their thread IDs in an array and then use a simple loop to join all of them.
#define NUM_THREADS 10
pthread_t workers[NUM_THREADS]; /* an array of thread IDs */
/* ... Code to create 10 threads using pthread_create ... */
/* Wait for all 10 threads to finish */
for (int i = 0; i < NUM_THREADS; i++)
pthread_join(workers[i], NULL);
This loop ensures the parent thread waits for all ten child threads to complete before it continues, which is essential for collecting all their results or performing a final aggregation.
4.4.2 Windows Threads¶
The Windows thread library is a kernel-level API for creating and managing threads on the Windows operating system. The program in Figure 4.13 demonstrates its use, and its structure is very similar to the Pthreads example, using synchronous threading.
Program Walkthrough¶
1. Includes and Global Data
#include <windows.h>
#include <stdio.h>
DWORD Sum; /* data is shared by the thread(s) */
windows.his the essential header file for the Windows API.DWORD Sum;is a global variable.DWORDis a Windows-specific data type for a 32-bit unsigned integer. As in Pthreads, global data is automatically shared by all threads.
2. The Thread Function
DWORD WINAPI Summation(LPVOID Param)
{
DWORD Upper = *(DWORD*)Param;
for (DWORD i = 1; i <= Upper; i++)
Sum += i;
return 0;
}
- The function that the new thread will run must have this specific signature:
DWORD WINAPI func_name(LPVOID Param).DWORDis the return type.WINAPIis a calling convention specific to the Windows API.LPVOID Paramis a pointer to a void, allowing any data type to be passed to the thread.
DWORD Upper = *(DWORD*)Param;This line casts theLPVOIDparameter back into a pointer to aDWORDand then dereferences it to get the integer value.- The function calculates the sum and stores it in the shared global variable
Sum. - It terminates by
return 0;, which is equivalent topthread_exit(0).
3. Main Function - The Parent Thread
int main(int argc, char *argv[])
{
DWORD ThreadId;
HANDLE ThreadHandle;
int Param;
DWORD ThreadId;will store the unique ID assigned to the new thread by the system.HANDLE ThreadHandle;This is a crucial concept. A HANDLE is an abstract reference to a kernel object—in this case, the thread. We use the handle to manage the thread.int Param;is a local variable to hold the command-line argument converted to an integer.
4. Thread Creation
Param = atoi(argv[1]);
ThreadHandle = CreateThread(
NULL, /* default security attributes */
0, /* default stack size */
Summation, /* thread function */
&Param, /* parameter to thread function */
0, /* default creation flags */
&ThreadId); /* returns the thread identifier */
CreateThread()is the function that creates the new thread. Its parameters are:- Security Attributes:
NULLfor default. Controls if child processes can inherit this handle. - Stack Size:
0for the default stack size (typically 1 MB). - Start Function:
Summation, the function the thread will execute. - Parameter:
&Param, a pointer to the data we want to pass to theSummationfunction. - Creation Flags:
0means the thread runs immediately after creation.CREATE_SUSPENDEDwould create it in a paused state. - Thread ID:
&ThreadId, a pointer to where the system should store the new thread's ID.
- Security Attributes:
5. Synchronous Join
WaitForSingleObject(ThreadHandle, INFINITE);
CloseHandle(ThreadHandle);
printf("sum = %d\n", Sum);
WaitForSingleObject(ThreadHandle, INFINITE)is the Windows equivalent ofpthread_join().- It makes the parent thread block (wait) until the kernel object referred to by
ThreadHandle(the child thread) enters a "signaled" state (which happens when the thread terminates). INFINITEmeans the parent will wait forever if necessary.
- It makes the parent thread block (wait) until the kernel object referred to by
CloseHandle(ThreadHandle)is critical for resource management. Once you are done with a kernel object, you must close its handle to free system resources. The thread itself continues to run if it hasn't already finished.
Waiting for Multiple Threads¶
The text explains that to wait for multiple threads, you use WaitForMultipleObjects().
Example:
// Assume THandles is an array of HANDLEs of size N
WaitForMultipleObjects(
N, // Number of objects (threads) to wait for
THandles, // Array of HANDLEs
TRUE, // Wait for ALL objects to be signaled (FALSE = wait for ANY)
INFINITE // Timeout duration (INFINITE to wait forever)
);
This function allows the parent thread to wait for an entire pool of worker threads to finish their work before proceeding, which is essential for synchronous execution patterns.
4.4.3 Java Threads¶
Threads are a fundamental part of the Java language, and the Java API provides extensive features for creating and managing them. Every Java program, even a simple one with just a main() method, runs as a single thread within the Java Virtual Machine (JVM). The JVM then uses the host operating system's native thread library (e.g., Pthreads on Linux, Windows threads on Windows) to implement these Java threads.
Two Techniques for Creating Threads¶
Java provides two primary ways to create a thread. The more modern and commonly used technique is to implement the Runnable interface.
1. Implement the Runnable Interface (Common Practice)
This involves creating a class that implements the Runnable interface, which requires defining a single method: public void run().
class Task implements Runnable {
public void run() {
System.out.println("I am a thread.");
}
}
The code inside the run() method is the task that will execute in a separate thread.
2. Extend the Thread Class (Less Common)
The alternative is to create a new class that is derived from the Thread class and override its run() method.
Thread Creation and Execution¶
Once you have a Runnable object, you create and start a thread like this:
Thread worker = new Thread(new Task()); // Create a Thread object, passing a Runnable
worker.start(); // Start the thread
The start() method is crucial. It does two things:
- It allocates memory and initializes a new thread within the JVM.
- It automatically calls the
run()method on yourRunnableobject, making the thread eligible to be scheduled by the JVM.
Important: You should never call the run() method directly. Calling run() directly would simply execute the method in the current thread, like a normal function call. You must use start() to launch a new thread of execution.
Modern Syntax: Lambda Expressions¶
Starting with Java 8, Lambda expressions provide a much more concise way to create threads, eliminating the need for a separate class. A Lambda expression effectively defines an anonymous Runnable.
Runnable task = () -> {
System.out.println("I am a thread.");
};
Thread worker = new Thread(task);
worker.start();
This is functionally identical to the first example but is written in a cleaner, more compact style. Lambda expressions are a key feature for writing modern, readable concurrent code in Java.
Synchronous Joining in Java¶
To implement the synchronous threading strategy (where the parent thread waits for the child to finish), Java provides the join() method.
try {
worker.join(); // Main thread waits here for 'worker' thread to finish
}
catch (InterruptedException ie) { }
- The
join()method causes the current thread (e.g., themainthread) to block until the thread it is called on (worker) terminates. - It can throw an
InterruptedExceptionif the waiting thread is interrupted, which must be caught or declared.
Waiting for Multiple Threads: Just like in Pthreads, you can wait for multiple threads by storing their references and joining them in a loop.
Thread[] workers = new Thread[10];
// ... Code to create and start 10 threads ...
// Wait for all 10 threads to finish
for (int i = 0; i < workers.length; i++) {
try {
workers[i].join();
} catch (InterruptedException ie) { }
}
This pattern ensures the parent thread does not proceed until all child threads have completed their work, which is essential for collecting results.
4.4.3.1 Java Executor Framework¶
The basic Thread and Runnable approach has been in Java since the beginning. However, Java 5 introduced a more powerful and flexible framework for concurrency in the java.util.concurrent package. The Executor Framework provides better control over thread creation, execution, and communication.
Core Concept: Separating Task from Execution¶
The fundamental idea is to separate what needs to be done (the task) from how it is run (the mechanics of thread management).
The Executor Interface:
At the heart of the framework is the Executor interface:
public interface Executor {
void execute(Runnable command);
}
Instead of creating a Thread object and calling start(), you now create an Executor and call execute() with your Runnable task.
Executor service = new SomeExecutorImplementation();
service.execute(new Task()); // Task is a class implementing Runnable
This design is based on the producer-consumer model: your application produces tasks (Runnable objects), and the threads managed by the Executor consume and execute them.
Solving the Data Sharing Problem: Callable and Future¶
In Pthreads and Windows, sharing results is easy with global variables. Java, being a pure object-oriented language, has no global data. The basic Runnable interface's run() method also cannot return a result.
The Executor Framework solves this with two key interfaces:
Callable<V>Interface:- Similar to
Runnable, but itscall()method can return a value and throw checked exceptions. Vis the type of the result returned by the task.
- Similar to
Future<V>Interface:- Represents the result of an asynchronous computation (a task that is running in another thread).
- You use a
Futureobject to retrieve the result after the task has completed.
Program Walkthrough (Figure 4.14)¶
The provided Java program demonstrates the Executor Framework with Callable and Future.
1. The Summation Task (Callable)
class Summation implements Callable<Integer> {
private int upper;
public Summation(int upper) {
this.upper = upper;
}
public Integer call() {
int sum = 0;
for (int i = 1; i <= upper; i++)
sum += i;
return new Integer(sum); // This is the key: returning a result
}
}
- The
Summationclass implementsCallable<Integer>, meaning itscall()method will return anInteger. - The upper bound for the summation is passed via the constructor.
- The
call()method performs the calculation and returns the result.
2. The Main Driver (Using the Executor)
public class Driver {
public static void main(String[] args) {
int upper = Integer.parseInt(args[0]);
// Step 1: Create an ExecutorService (a thread pool with one thread)
ExecutorService pool = Executors.newSingleThreadExecutor();
// Step 2: Submit the Callable task to the executor. This returns a Future.
Future<Integer> result = pool.submit(new Summation(upper));
try {
// Step 3: Retrieve the result using Future.get()
System.out.println("sum = " + result.get());
} catch (InterruptedException | ExecutionException ie) { }
}
}
Executors.newSingleThreadExecutor()creates anExecutorService(a type ofExecutor) that uses a single worker thread.pool.submit()is used to submit aCallabletask. It immediately returns aFuture<Integer>object. This object is a promise of a future result; the calculation is now running in the background.result.get()is the equivalent ofjoin(), but with a crucial difference: it retrieves the result. This method blocks the main thread until the computation is complete and then returns theIntegerresult produced by thecall()method.
Benefits of the Executor Framework¶
While it seems more complex than new Thread().start(), the Executor Framework provides significant advantages:
- Returning Results:
CallableandFutureallow threads to return values, which is not possible with basicRunnable. - Decoupling: It cleanly separates task submission from task execution. You focus on defining the task, and the framework handles the threading details.
- Efficiency and Control: In Section 4.5.1, you will see that
ExecutorServicecan manage a pool of reusable threads, which is far more efficient than creating and destroying a new thread for every single task. This is essential for managing a large number of tasks.
4.5 Implicit Threading¶
As multicore processors become standard, applications are being designed with hundreds or even thousands of threads. Managing this level of concurrency explicitly is very difficult and error-prone, especially concerning the challenges of correctness (like race conditions and deadlocks, covered in Chapters 6 and 8).
The Core Idea of Implicit Threading¶
Implicit threading is a strategy to address this complexity. The core idea is to transfer the responsibility for creating and managing threads from the application developer to the compiler and run-time libraries.
How it Works:
Instead of the programmer explicitly creating Thread objects and managing their lifecycle, they simply identify units of work, or tasks, that can run in parallel. These tasks are typically written as simple functions. A run-time library then takes these tasks and automatically maps them to a pool of threads, handling all the complex details of thread creation, scheduling, synchronization, and destruction.
The Advantage: The programmer only needs to think about what can be done in parallel (the tasks). The system handles how to execute them in parallel (the threading). This significantly reduces the potential for bugs and simplifies development.
The JVM and the Host Operating System¶
This note clarifies how Java threads, which we've been discussing, are actually implemented, linking back to the multithreading models from Section 4.3.
- The Java Virtual Machine (JVM) runs as an application on top of a host operating system (like Windows, Linux, or macOS).
- The JVM specification is abstract; it does not dictate how Java threads must be mapped to the OS. This decision is left to the specific JVM implementation.
- On operating systems like Windows, Linux, and macOS that use the one-to-one model, each Java thread is typically mapped directly to a kernel thread.
- Furthermore, the JVM implementation will use the host OS's native thread library to create these kernel threads (e.g., the Windows API on Windows, Pthreads on Linux/macOS).
This abstraction is what allows Java to be "write once, run anywhere." The same Java thread code works correctly regardless of the underlying OS threading implementation.
The Shift in Programming Paradigm¶
Implicit threading represents a fundamental shift:
- Explicit Threading (Old Way): Programmer manages
Threadobjects,start(),join(), etc. - Implicit Threading (New Way): Programmer defines
Taskobjects or functions. A library (like the Java Executor Framework) manages everything else.
The text states that implicit threading libraries typically use a threading model similar to the many-to-many model, where a pool of kernel threads is used to execute a potentially larger number of logical tasks defined by the programmer.
The following sections will explore four specific approaches to implicit threading.
4.5.1 Thread Pools¶
A thread pool is a classic and powerful implementation of implicit threading. It directly addresses the performance and resource management problems of the "thread-per-request" model.
The Problem with Unlimited Thread Creation¶
Recall the multithreaded server from Section 4.1. While better than creating a process for each request, creating a new thread for every single request has two major drawbacks:
- Performance Overhead: The time it takes to create and later destroy a thread is significant. For a high-traffic server, this constant creation and destruction of threads wastes CPU cycles and memory.
- Resource Exhaustion: If the server receives a massive number of concurrent requests, it will create a correspondingly massive number of threads. This can exhaust system resources like CPU time and memory, potentially crashing the server.
How a Thread Pool Works¶
The thread pool pattern provides a solution:
- At startup, a fixed number of threads are created and placed into a "pool," where they sit idle, waiting for work.
- When a request arrives, it is submitted to the pool as a task (a
RunnableorCallablein Java). The server does not create a new thread. - An available thread from the pool is awakened and assigned the task. If no threads are available, the task waits in a queue until one becomes free.
- When a thread finishes its task, it does not terminate. Instead, it returns to the pool, ready to accept the next task.
Refer to the Android Note: The text highlights that Android's RPC system uses a thread pool for its remote services, allowing them to handle multiple concurrent client requests efficiently.
Benefits of Thread Pools¶
- Performance: Using an existing thread from a pool is much faster than creating a new thread from scratch.
- Resource Control: The pool places a strict bound on the number of concurrent threads, preventing the system from being overwhelmed. This is crucial for system stability.
- Flexibility: It separates the task from the execution mechanics. This allows for advanced features like scheduling tasks for delayed or periodic execution.
Tuning the Pool: The number of threads in the pool can be set based on system factors (number of CPUs, memory) and expected load. More sophisticated pools can even resize themselves dynamically, using more threads under high load and fewer under low load to save resources.
Thread Pools in Practice: Windows API Example¶
The text provides an example using the Windows API to show how a programmer interacts with a thread pool.
1. Define the Task Function: This function is the code that will run in a pooled thread.
DWORD WINAPI PoolFunction(PVOID Param) {
/* this function runs as a separate thread from the pool. */
}
2. Submit the Task to the Pool:
Instead of using CreateThread(), the program uses QueueUserWorkItem() to submit the task.
QueueUserWorkItem(&PoolFunction, NULL, 0);
This function's parameters are:
&PoolFunction: A pointer to the function to execute.NULL: The parameter to pass to the function (none in this case).0: Flags for special instructions (none in this case).
The Windows thread pool API is extensive, also providing utilities to run functions after a timer expires or when an I/O operation completes. This demonstrates the power and flexibility of letting a library handle the threading details.
4.5.1.1 Java Thread Pools¶
The Java Executor Framework, introduced earlier, is the primary mechanism for creating and managing thread pools. It provides several pre-configured pool architectures through factory methods in the Executors class.
Three Common Thread Pool Models¶
Java's java.util.concurrent package offers several types of thread pools. The text focuses on three core models:
Single Thread Executor (
newSingleThreadExecutor()):- Creates a pool with exactly one thread.
- All submitted tasks are executed sequentially by this single thread.
- Useful for tasks that need to be processed in a strict order without concurrency.
Fixed Thread Executor (
newFixedThreadPool(int size)):- Creates a pool with a fixed number of threads.
- This is the most common pool for controlling resource usage. The number of concurrent tasks is strictly limited to the pool size; extra tasks wait in a queue.
- Ideal for known, sustained workloads (e.g., a server with a predictable number of concurrent requests).
Cached Thread Executor (
newCachedThreadPool()):- Creates an "unbounded" pool that can create new threads as needed, but will reuse idle existing threads.
- If a thread is idle for 60 seconds, it is terminated and removed from the pool.
- This pool improves performance for many short-lived asynchronous tasks but is dangerous for unbounded workloads as it could create thousands of threads.
You have already seen newSingleThreadExecutor() in Figure 4.14.
Creating and Using a Thread Pool (Figure 4.15)¶
The Java program in Figure 4.15 demonstrates the standard pattern for using a thread pool.
1. Create the Thread Pool:
ExecutorService pool = Executors.newCachedThreadPool();
This line uses a factory method to create and return an ExecutorService instance—in this case, a cached thread pool.
2. Submit Tasks to the Pool:
for (int i = 0; i < numTasks; i++)
pool.execute(new Task()); // 'Task' is a class implementing Runnable
- The
execute(Runnable command)method is used to submit tasks to the pool. - The pool manages the internal queue and assigns each task to an available thread from the pool. The programmer does not create any
Threadobjects.
3. Shut Down the Pool:
pool.shutdown();
- The
shutdown()method is crucial for graceful termination. - It tells the executor to stop accepting new tasks.
- The pool will then shut down after all previously submitted tasks have finished executing.
- (There is also a
shutdownNow()method which attempts to stop all actively executing tasks.)
Key Concepts Recap¶
ExecutorServiceInterface: This is the core interface you work with. It extends the basicExecutorinterface and adds life-cycle management methods likesubmit(),shutdown(), andawaitTermination().- Implicit Threading in Action: This is a perfect example of implicit threading. The programmer defines
Taskobjects (the what) and submits them to theExecutorService. The framework handles all the details of thread creation, lifecycle, and assignment (the how). This makes the application more robust, efficient, and easier to write correctly.
4.5.2 Fork-Join¶
The fork-join paradigm is a specific strategy for parallel programming that is exceptionally well-suited for implicit threading.
Recap: The Explicit Fork-Join Model¶
As covered in Section 4.4, the basic idea is:
- Fork: A parent thread creates (forks) one or more child threads to perform subtasks.
- Work: The parent and children work concurrently.
- Join: The parent waits (joins) for all child threads to finish.
- Combine: The parent combines the results from the children.
This is the synchronous threading strategy we saw in the Pthreads, Windows, and basic Java examples.
The Implicit Fork-Join Model¶
The implicit threading approach takes this same logical pattern but removes the burden of thread management from the programmer.
How it Works:
- Instead of explicitly creating
Threadobjects, the programmer only identifies parallel tasks that can be forked. - A sophisticated run-time library is responsible for:
- Creating and managing the actual threads.
- Deciding how many threads to create (e.g., based on the number of CPU cores).
- Assigning the forked tasks to these threads.
- Handling the joining and result combination.
Refer to Figure 4.16: This figure illustrates the implicit model. The "Fork" step shows tasks being designated, not threads being created. A central "Library" component then maps these tasks to a pool of worker threads.
Relationship to Thread Pools¶
The text describes the implicit fork-join model as a "synchronous version of thread pools."
- Standard Thread Pool (Asynchronous): You submit independent tasks (like in the web server), and you don't necessarily wait for them to finish in a specific order.
- Fork-Join Pool (Synchronous): The pool is used to execute tasks that have explicit dependencies—a parent task cannot finish until its children tasks have completed and returned their results. The library manages the complex scheduling required to efficiently execute these interdependent tasks.
The library uses smart heuristics (like the number of available cores) to determine the optimal number of threads, ensuring efficient use of system resources without over-subscription.
In summary, the fork-join pattern is a powerful parallel algorithm design. Using an implicit threading library to implement it allows developers to express the parallel logic clearly while the library handles the complex and error-prone details of thread synchronization and workload balancing. This is especially effective for problems that can be broken down recursively, like sorting or tree traversal.
4.5.2.1 Fork Join in Java¶
What is the Fork-Join Framework?¶
The fork-join framework is a Java library (added in Java 7) designed specifically for a common type of parallel programming problem: recursive divide-and-conquer algorithms. If you think of algorithms like Quicksort or Mergesort, they perfectly fit this model. The core idea is simple: you break a big problem down into smaller sub-problems, solve those sub-problems concurrently, and then combine the results.
The General Recursive Algorithm¶
The framework operates on a simple, recursive pattern. Every task you create follows this logical structure:
Task(problem)
if problem is small enough
solve the problem directly // This is the "base case"
else
subtask1 = fork(new Task(subset of problem)) // Split off a new task to be run in parallel
subtask2 = fork(new Task(subset of problem)) // Split off another task
result1 = join(subtask1) // Wait for the first subtask's result
result2 = join(subtask2) // Wait for the second subtask's result
return combined results // Merge the results from the subtasks
- Fork: The
fork()method is used to start a new task (a new piece of work) asynchronously. It's like telling a helper thread, "Go start working on this sub-problem." - Join: The
join()method is used to get the result of a forked task. This call will block (wait) until the result from that specific subtask is ready. - Base Case: This is the condition that stops the recursion. When the problem becomes small enough, it's more efficient to solve it directly in the current thread rather than paying the overhead of creating new tasks.
(You should refer to Figure 4.17 in your textbook for a graphical depiction of this forking and joining process.)
A Concrete Example: Summing an Array¶
Let's make this concrete by designing a program that sums all the elements in a large array of integers using the fork-join framework.
Step 1: Set up the Thread Pool and Initial Task
First, you create a special thread pool called a ForkJoinPool. This pool is smart and manages the worker threads that will execute your tasks. You then create the initial, "root" task that represents the entire problem and submit it to the pool.
ForkJoinPool pool = new ForkJoinPool();
int[] array = new int[SIZE]; // This array contains the integers to be summed
SumTask task = new SumTask(0, SIZE - 1, array); // The task to sum the whole array
int sum = pool.invoke(task); // Submit the task and get the final result
The invoke() method submits the task and waits for its completion, returning the final sum.
Step 2: Define the Task (SumTask)
The real work happens inside the SumTask class. This class must define what the task does in its compute() method.
(You should refer to Figure 4.18 in your textbook for the full code listing of the SumTask class.)
Here's a breakdown of what the compute() method does:
- Check the Base Case: It first checks if the portion of the array it's responsible for is smaller than a certain
THRESHOLD(e.g., 1000 elements). If it is, it calculates the sum directly using a simple loop. This is the efficient, direct solution. - Divide (Fork): If the portion is too large, it splits the work in half. It creates two new
SumTaskobjects:- One for the left half of the current array segment.
- One for the right half.
It then uses
fork()to push these new tasks into theForkJoinPool, making them available for worker threads to execute.
- Conquer (Join): After forking, the task calls
join()on both subtasks. This is crucial: the task waits for the left subtask to finish and give its result, and then waits for the right subtask to do the same. - Combine: Finally, it combines the results from the left and right subtasks (by simply adding the two partial sums) and returns the total.
Key Classes in the Fork-Join Framework¶
The framework is organized around a few key classes, as shown in the UML diagram in (Figure 4.19):
ForkJoinTask<V>: This is the abstract base class for all tasks.RecursiveTask<V>: You extend this class when your task returns a result (like ourSumTaskwhich returns anInteger). You must override thecompute()method.RecursiveAction: You extend this class when your task does not return a result (e.g., a task that sorts an array in-place). You also override thecompute()method here.
Important Considerations¶
- Choosing the Threshold (
THRESHOLD): Deciding when a problem is "small enough" is critical. If the threshold is too high, you don't get enough parallelism. If it's too low, the overhead of creating and managing tasks outweighs the benefits of parallel execution. The textbook notes that finding the optimal value often requires careful experimentation and timing trials. - Work Stealing: This is the "magic" that makes the
ForkJoinPoolso efficient. Each thread in the pool has its own double-ended queue (deque) of tasks. When a thread forks a new task, it pushes it onto its own deque. When a thread is ready for more work, it pops a task from the head of its own deque. The key innovation is that if a thread's deque is empty, it can "steal" a task from the tail of another thread's deque. This work-stealing algorithm efficiently balances the load across all threads, ensuring that no thread is idle while others have work.
4.5.3 OpenMP¶
What is OpenMP?¶
OpenMP is a powerful tool for parallel programming in shared-memory environments, used with C, C++, and FORTRAN. It works through a combination of compiler directives (special commands in your code) and an API library. The main goal of OpenMP is to make it easier to take parts of a program that can run at the same time and actually make them run in parallel, without the programmer having to manually create and manage threads.
The Core Concept: Parallel Regions¶
The fundamental idea in OpenMP is the parallel region. This is a block of code that is marked to be executed by multiple threads simultaneously.
Here is a simple C program that demonstrates this:
#include <omp.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
/* sequential code */ // This runs once, on the main thread.
#pragma omp parallel
{
printf("I am a parallel region.\n");
}
/* sequential code */ // This also runs once, after the parallel region.
return 0;
}
Let's break down what happens:
- The program starts running sequentially (with a single thread).
- When it encounters the
#pragma omp paralleldirective, the OpenMP run-time library springs into action. - It creates a team of threads. By default, it creates one thread for each processing core in your system.
- On a dual-core system: 2 threads are created.
- On a quad-core system: 4 threads are created.
- All of these threads then simultaneously execute the code block that follows the directive. In this example, each thread will print "I am a parallel region." So, if you have 4 cores, you will see that message printed 4 times.
- When a thread reaches the end of the block
}, it terminates. The main thread (the original one) continues, and the program resumes its sequential execution.
Parallelizing Loops¶
A very common use for OpenMP is to speed up loops where each iteration is independent of the others. This is called a parallel for loop.
Consider this problem: you have two arrays, a and b, each of size N. You want to add them together and store the result in array c. The sequential way to do this is with a simple for loop.
With OpenMP, you can parallelize this loop with a single directive:
#pragma omp parallel for
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
Here's what the #pragma omp parallel for directive does:
- It creates a team of threads (just like
omp parallel). - It then automatically divides the iterations of the
forloop among these threads. - For example, if you have 1000 iterations (
N=1000) and 4 threads, OpenMP might assign iterations 0-249 to thread 1, 250-499 to thread 2, and so on. Each thread works on its assigned chunk of the array independently and in parallel. - There is an implicit barrier at the end of the loop, meaning all threads must finish their assigned iterations before any thread can proceed to the code that follows.
Key Features and Flexibility¶
OpenMP is not just about basic parallelism; it gives developers fine-grained control:
- Controlling the Number of Threads: You are not stuck with the default. You can manually set the number of threads in your team using functions like
omp_set_num_threads()or by setting environment variables. - Data Scoping: This is a critical concept. OpenMP allows you to specify whether variables are shared between all threads or private to each thread.
- In the loop example above, the arrays
a,b, andcare typically shared (all threads need to read/write to them). - The loop index
iis made private by theparallel fordirective. This means each thread gets its own separate copy ofi, so they don't interfere with each other's counting.
- In the loop example above, the arrays
- Wide Availability: OpenMP is supported by many popular open-source and commercial compilers (like GCC, Clang, and MSVC) on Linux, Windows, and macOS.
In summary, OpenMP provides a high-level, relatively simple way to add parallelism to your code by using special directives that tell the compiler how to split the work across multiple threads in a shared-memory system.
4.5.4 Grand Central Dispatch¶
What is Grand Central Dispatch (GCD)?¶
Grand Central Dispatch (GCD) is a technology developed by Apple for its macOS and iOS operating systems. It's a complete solution for managing parallel tasks, consisting of:
- A run-time library
- An API (Application Programming Interface)
- Language extensions
Its main goal is to simplify parallel programming. Like OpenMP, GCD handles the complex details of thread management, allowing you, the developer, to focus on what tasks can run in parallel, not how the threads are created and scheduled.
The Core Mechanism: Dispatch Queues¶
The central concept in GCD is the dispatch queue. You don't manage threads directly; instead, you define units of work (tasks) and place them into a queue. GCD's system then takes these tasks from the queue and schedules them to run on available threads from a pool it manages.
GCD has two fundamental types of queues:
1. Serial Queues (Private Dispatch Queues)
- Tasks are removed from the queue in a First-In, First-Out (FIFO) order.
- However, only one task is executed at a time. A task must completely finish before the next task in the queue is removed and started.
- Characteristics:
- Each application has a special serial queue called the main queue, which is used for all updates to the user interface.
- Developers can create their own serial queues for process-specific tasks.
- They are perfect for ensuring that a series of tasks happens in a strict, predictable sequence, which is crucial for preventing race conditions when accessing shared resources.
2. Concurrent Queues (Global Dispatch Queues)
- Tasks are also removed in FIFO order.
- However, multiple tasks can be removed at once, allowing them to execute in parallel.
- The system provides several system-wide concurrent queues that you can use. These are categorized by Quality of Service (QoS) classes, which help the system prioritize your tasks intelligently.
Quality of Service (QoS) Classes¶
GCD uses QoS classes to determine the priority and resource allocation for tasks. This allows the system to ensure a responsive user experience while efficiently managing battery life. The four primary classes are, in order of priority:
QOS_CLASS_USER_INTERACTIVE:
- Purpose: For tasks that directly interact with the user to keep the interface responsive and smooth (e.g., animations, event handling).
- Workload: Tasks should be very short and require minimal processing time.
QOS_CLASS_USER_INITIATED:
- Purpose: For tasks that the user has started and is waiting for, so they can continue interacting with the app (e.g., opening a document, processing a user's button press).
- Workload: Can take longer than user-interactive tasks, but should still be completed in a few seconds or less.
QOS_CLASS_UTILITY:
- Purpose: For long-running tasks where the user does not expect immediate results (e.g., importing data, downloading a large file, complex calculations).
- Workload: These are typically longer-running operations that provide a progress indicator.
QOS_CLASS_BACKGROUND:
- Purpose: For tasks that are not time-sensitive and the user is not directly aware of (e.g., indexing files, performing backups, pre-fetching content).
- Workload: The system will schedule these tasks when resources are available, prioritizing the user's battery life.
Defining Tasks: Blocks and Closures¶
To submit work to a queue, you need to define the task. GCD provides two ways to do this, depending on the language:
Blocks (C, C++, Objective-C): A block is a language extension that defines a self-contained unit of work. It's syntax is a caret
^followed by braces{ }containing the code.- Example:
^{ printf("I am a block"); }
- Example:
Closures (Swift): In Swift, the same concept is achieved using a closure. Syntactically, it is very similar to a block but without the leading caret.
- Example:
{ print("I am a closure.") }
- Example:
Putting It All Together: A Code Example¶
The following Swift code shows how to obtain a concurrent queue and submit a task to it asynchronously.
// 1. Get a reference to a system-wide concurrent queue.
// We specify the .userInitiated Quality of Service class.
let queue = DispatchQueue.global(qos: .userInitiated)
// 2. Submit a task (closure) to the queue to be executed asynchronously.
// The dispatch async() function returns immediately, without waiting for the task to finish.
dispatchAsync(queue) {
print("I am a closure running on a concurrent queue.")
}
GCD Internals and Portability¶
- Under the Hood: Internally, GCD's thread pool is implemented using POSIX threads (Pthreads). GCD actively manages this pool, dynamically growing or shrinking the number of threads based on the current demand and the overall capacity of the system.
- Portability: While developed by Apple, the core library behind GCD (
libdispatch) has been released as open-source and has been ported to other operating systems, such as FreeBSD.
4.5.5 Intel Threading Building Blocks (TBB)¶
What is Intel TBB?¶
Intel Threading Building Blocks (TBB) is a sophisticated C++ template library designed to support the development of parallel applications. Its key advantage is that it is purely a library; it doesn't require a special compiler or language extensions. You use it by including its headers and linking to its library.
The core idea is that you, the developer, specify logical tasks that can run in parallel. The TBB task scheduler then takes over, managing the complex job of mapping these tasks onto the underlying physical threads (like Pthreads or Windows threads). This scheduler is very smart—it provides automatic load balancing and is cache-aware, meaning it tries to schedule tasks in a way that maximizes the use of data already in the CPU cache for faster execution.
Key Features of TBB¶
TBB provides a rich toolkit for parallel programming, including:
- Templates for Parallel Algorithms: Such as
parallel_for,parallel_reduce, andpipeline, which let you easily parallelize common patterns. - Synchronization: Support for atomic operations and mutual exclusion locks (mutexes).
- Concurrent Data Structures: Thread-safe versions of common data structures like a hash map, queue, and vector. These are designed for high performance in concurrent environments and can be used as safer alternatives to standard template library (STL) containers in multi-threaded code.
A Core Example: The Parallel For Loop¶
Let's explore the most common use case: parallelizing a simple loop.
Imagine you have a function apply(float value) that performs an operation on a single value. You also have an array v of size n containing float values. The sequential way to process this array is with a standard for loop:
// Sequential for loop
for (int i = 0; i < n; i++) {
apply(v[i]);
}
To parallelize this on a multicore system, you could manually split the array into chunks and assign each chunk to a different thread. However, this approach is tedious and hardcodes the parallelism to a specific number of cores, making the code inflexible and non-portable.
TBB solves this with its parallel_for template. The general form is:
parallel_for(range, body)
range: Defines the set of items to be processed (the "iteration space").body: Defines the operation to be performed on a sub-range or individual items.
Here is how you would rewrite the sequential loop using TBB:
// Parallel for loop with TBB
tbb::parallel_for(size_t(0), n, [=](size_t i) { apply(v[i]); });
Let's break down the three parameters in this call:
size_t(0): The start of the iteration space (index 0).n: The end of the iteration space (indexn-1). This defines that we want to iterate from0ton-1, just like the original loop.[=](size_t i) { apply(v[i]); }: This is a C++ lambda function that acts as the loop body.[=]: This is the capture clause. The equals sign means the lambda "captures" all variables from the outer scope by value. This means it gets its own copy of the arrayvand the sizen, which is safe for parallel execution.(size_t i): This is the parameter. The TBB library will call this lambda function multiple times in parallel, each time passing a different indexifrom the iteration space (0 to n-1).{ apply(v[i]); }: This is the body of the function, which executes for each indexi.
How TBB Manages the Work¶
When you call parallel_for, TBB's internal machinery takes over:
- Dividing Work: The library automatically divides the full range of iterations (0 to n-1) into smaller "chunks." You can specify the chunk size, but often the default is efficient.
- Creating Tasks: It creates a set of tasks, where each task is responsible for executing the loop body (the lambda function) over one of these chunks.
- Scheduling on Threads: The TBB task scheduler assigns these tasks to a pool of worker threads it manages. This is very similar to the Java Fork-Join framework. If one thread finishes its chunk early, the scheduler can give it another chunk from another thread (work-stealing), ensuring all threads stay busy and the load is balanced.
Summary and Advantage¶
The primary advantage of TBB is abstraction. You only need to identify what can run in parallel (by using constructs like parallel_for), and the library handles all the low-level details of thread creation, management, load balancing, and synchronization. This leads to code that is more portable, maintainable, and often performs better than a manual threading approach.
Intel TBB is available in both commercial and open-source versions and runs on all major operating systems: Windows, Linux, and macOS.
4.6 Threading Issues¶
This section covers important complications that arise when using threads in real-world programs. The standard behavior of familiar system calls can change in a multithreaded environment, and programmers must be aware of these nuances.
4.6.1 The fork() and exec() System Calls¶
As you learned in the Computer Architect course and Chapter 3, the fork() and exec() system calls are fundamental to process creation in UNIX/Linux systems. However, their semantics become more complex when the process performing the call contains multiple threads.
The Problem with fork() in a Multithreaded Program¶
The central question is: When a thread in a multithreaded process calls fork(), what does the new child process look like?
- Does it duplicate all threads? The new child process could be an exact replica of the parent, with copies of all its threads.
- Or is it single-threaded? The new process could contain only the thread that called the
fork().
Different UNIX systems have addressed this problem by providing two versions of the fork() call:
- One version that duplicates all threads of the parent process in the child.
- Another version that duplicates only the thread that invoked
fork().
The Behavior of exec()¶
The exec() system call behaves more predictably. If any thread in a process calls exec(), it works as described in Chapter 3: the program specified in the exec() call completely replaces the current process, including its code, data, heap, stack, and all of its threads. The new process starts from the main() function of the new program with a single thread.
Choosing the Right fork() Semantics¶
The choice between duplicating all threads or just one depends entirely on what the child process is intended to do immediately after the fork().
Case 1:
fork()is followed immediately byexec()- Scenario: This is a common pattern where the goal is to create a new process to run a completely different program (e.g., a shell creating a new process to run
ls). - Appropriate
fork()Semantics: In this case, duplicating only the calling thread is appropriate and efficient. The child process is going to have its entire address space wiped away and replaced by the new program whenexec()is called. Duplicating all the other threads from the parent would be a waste of effort, as they would be terminated instantly. It can also be dangerous if other threads are holding locks, as the locks would be copied into the child in a locked state and never unlocked (since the threads holding them don't exist), leading to deadlocks.
- Scenario: This is a common pattern where the goal is to create a new process to run a completely different program (e.g., a shell creating a new process to run
Case 2:
fork()is not followed byexec()- Scenario: The goal is to create a new child process that continues executing the same program, potentially to handle a task in parallel.
- Appropriate
fork()Semantics: Here, the child process needs to be a fully functional copy of the parent. Therefore, it should duplicate all threads. This allows the child to continue the parent's work using its entire parallel structure.
In summary, the correct use of fork() in a multithreaded program requires you to know the child's purpose and to use the version of fork() whose semantics match that purpose to ensure correctness and efficiency.
4.6.2 Signal Handling¶
What is a Signal?¶
A signal is a software interrupt delivered to a process in UNIX/Linux systems, notifying it that a specific event has occurred. Signals can be categorized by how they are delivered:
Synchronous Signals: These are generated by the process's own execution.
- Cause: The event is a direct result of an action performed by the process's own code.
- Examples: An illegal memory access (Segmentation Fault), division by zero, or executing an illegal instruction.
- Delivery: They are delivered to the very thread that caused the event.
Asynchronous Signals: These are generated by an event external to the process.
- Cause: An event outside the running process.
- Examples: A user pressing
Ctrl+Cto terminate a program (this sends theSIGINTsignal), or a timer expiring. - Delivery: They are sent to the process from another process or the operating system.
Handling a Signal¶
When a signal is delivered, it must be processed. There are two types of handlers:
- Default Signal Handler: Every signal has a default action defined by the kernel. For example, the default for
SIGSEGV(segmentation fault) is to terminate the process. ForSIGCHLD(child process stopped or terminated), the default is to ignore it. - User-Defined Signal Handler: A program can override the default action and specify its own function to be called when a specific signal arrives. This allows the program to perform custom cleanup or ignore the signal entirely.
The Challenge: Where to Deliver a Signal in a Multithreaded Program?¶
In a single-threaded program, this is simple: the one and only thread handles all signals. In a multithreaded program, it becomes a critical design question: which thread should receive the signal?
Several strategies exist:
- Deliver the signal to the thread to which the signal applies. (Most logical for synchronous signals).
- Deliver the signal to every thread in the process.
- Deliver the signal to certain threads in the process.
- Assign a specific thread to receive all signals for the process.
The correct strategy depends on the signal type:
For Synchronous Signals: The signal must be delivered to the thread that caused it. It wouldn't make sense for a thread that performed a legal operation to be terminated because another thread in the same process caused a segmentation fault.
For Asynchronous Signals: The strategy is less clear-cut and depends on the signal's intent.
- A signal like
SIGTERM(a request to terminate) is typically meant for the entire process, so it should be delivered to all threads. - Other asynchronous signals might be targeted. To manage this, POSIX threads (Pthreads) allows each thread to have its own signal mask, which it uses to specify which signals it will accept and which it will block.
- A signal like
APIs for Signal Delivery¶
Standard Process-Level Signal (
kill): The classic UNIX functionkill(pid_t pid, int signal)sends a signal to a process (identified bypid). In a multithreaded environment, the system then decides which thread within that process will handle it. Typically, it will be delivered to the first thread found that is not blocking that signal.Thread-Targeted Signal (
pthread_kill): Pthreads provides a more precise function:pthread_kill(pthread_t tid, int signal). This allows a programmer to send a specific signal to a specific thread (identified by its thread ID,tid), giving fine-grained control over signal handling.
Signal Handling in Windows¶
Windows does not have a direct equivalent to UNIX signals. Instead, it provides Asynchronous Procedure Calls (APCs).
- How it Works: An APC allows a thread to specify a function that should be called when that thread receives a particular notification. The APC is queued to the target thread, and the thread executes the function when it enters an "alertable" wait state.
- Key Difference: The APC model is more straightforward for threading because an APC is, by design, delivered to a specific, single thread. This avoids the ambiguity present in the UNIX multithreaded signal delivery model.
4.6.3 Thread Cancellation¶
Thread cancellation is the process of terminating a thread before it has naturally finished its work. This is a common feature in multi-threaded programs where results can become obsolete, or a user changes their mind.
Real-World Examples:
- Database Search: Multiple threads search a database in parallel. As soon as one thread finds the result, the others are canceled to save CPU resources.
- Web Browser: A web page loads with multiple threads (one for the main HTML, one for each image). When you press the "Stop" button, the browser cancels all the threads that are still loading the page.
A thread that is scheduled for termination is called the target thread.
The Two Scenarios for Cancellation¶
There are two fundamental ways to cancel a thread, each with major implications for safety and resource management.
1. Asynchronous Cancellation
- How it works: One thread immediately and forcibly terminates the target thread.
- The Major Problem: This is dangerous. The target thread can be killed at any point in its execution, even if it is in the middle of a critical operation. Imagine canceling a thread that:
- Has allocated memory (leading to a memory leak).
- Is holding a lock (causing other threads to wait forever, a deadlock).
- Is in the middle of updating a shared data structure (leaving it in a corrupted, inconsistent state).
- Resource Reclamation: The operating system will reclaim some system resources (like the thread's kernel data structures), but it often cannot reclaim all application-level resources (like memory, open files, or locks). Therefore, asynchronous cancellation is generally considered unsafe and is not recommended.
2. Deferred Cancellation
- How it works: The cancellation request is sent to the target thread, but the thread is not terminated immediately. Instead, the target thread periodically checks if a cancellation request is pending. When it sees the request, it terminates itself at a safe point in its code.
- Why it's Better: This allows the thread to be canceled when it is not in the middle of a critical operation. The thread can release any resources it holds (like freeing memory or unlocking mutexes) before it terminates, preventing leaks, deadlocks, and data corruption.
Thread Cancellation in Pthreads¶
In the Pthreads (POSIX threads) library, cancellation is initiated using the pthread_cancel(tid) function. However, calling this function only requests cancellation; it does not guarantee the thread will be terminated. The actual behavior depends on the target thread's configuration.
Pthreads Cancellation Modes:
A thread's cancellation behavior is controlled by a state and a type.
| Mode | State | Type |
|---|---|---|
| Off | Disabled |
-- |
| Deferred | Enabled |
Deferred |
| Asynchronous | Enabled |
Asynchronous |
- State (
Disabled/Enabled): If cancellation isDisabled, the thread will ignore all cancellation requests (though they remain pending). The thread can later enable cancellation to act on them. - Type (
Deferred/Asynchronous): This determines how an enabled thread responds to a cancellation request. The default type isDeferred.
How Deferred Cancellation Works in Code:
A thread using deferred cancellation will only be terminated when it reaches a cancellation point. A cancellation point is a specific function call (usually a blocking system call) where it is safe to check for termination.
- Common cancellation points include
read(),write(),sleep(), andpthread_join(). (You can see a full list on a Linux system with the commandman pthreads). - A thread can also create its own cancellation point by calling
pthread_testcancel(). This function does nothing if there is no pending cancellation request. If there is a request, the function does not return, and the thread is terminated immediately at that point.
Code Example of Deferred Cancellation:
while (1) {
/* do some work for awhile */
...
/* check if there is a cancellation request */
pthread_testcancel();
}
Cleanup Handlers: Pthreads allows a thread to register cleanup handler functions. These are functions that are automatically called if the thread is canceled. This is the primary mechanism for ensuring resources are properly released before the thread dies (e.g., to free memory or unlock a mutex).
Important Note: Due to the severe risks of resource leaks and data corruption, the Pthreads documentation strongly discourages the use of asynchronous cancellation. It is safer and more predictable to use deferred cancellation and manage resource cleanup properly.
Implementation Note on Linux: On Linux systems, the Pthreads library implements thread cancellation internally using signals (which were discussed in Section 4.6.2).
Thread Cancellation in Java¶
Java provides a mechanism for thread cancellation that follows a policy very similar to deferred cancellation in Pthreads. It is designed to be safer and more cooperative than forced termination.
The Java Interruption Mechanism¶
Instead of a direct cancel command, Java uses an interruption model.
Initiating Cancellation: To request that a thread terminates, you call the
interrupt()method on theThreadobject.Thread worker; . . . // Set the interruption status of the thread worker.interrupt();
This method does not forcibly stop the thread. Instead, it does two things:
- It sets the thread's interruption status to
true. - If the thread is currently blocked in a method like
sleep(),wait(), orjoin(), that method will immediately throw anInterruptedException.
- It sets the thread's interruption status to
Checking for Cancellation: The target thread is responsible for periodically checking if it has been interrupted and then cleaning up and terminating itself. This is the "cooperative" or "deferred" part.
A thread can check its interruption status in a loop using the
isInterrupted()method:while (!Thread.currentThread().isInterrupted()) { // Do some work . . . } // Once interrupted, break out of the loop and clean up
When the
interrupt()method is called, theisInterrupted()check becomestrue, the loop condition fails, and the thread exits the loop. The thread can then release any resources and terminate gracefully.
Handling Blocking Calls with InterruptedException¶
A crucial part of Java's design is what happens when a thread is interrupted while blocked. If the thread is sleeping (Thread.sleep()) or waiting (object.wait()), the blocking method will throw an InterruptedException.
When this exception is thrown, the interruption status is automatically cleared (set back to false). The standard practice is to either terminate the thread or re-set the interruption status and exit.
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// The thread was interrupted while sleeping.
// Restore the interruption status and break out of the loop
Thread.currentThread().interrupt();
}
Why is this similar to Pthreads?
- Cooperative: Both models rely on the target thread to notice the cancellation request and terminate itself safely.
- Safe Points: In Pthreads, cancellation happens at "cancellation points" (like
read()). In Java, interruption is noticed either during an explicit check (isInterrupted()) or when the thread is in a blocking state (which are the natural safe points in Java).
Summary: Java's interrupt() mechanism is a form of deferred cancellation. It provides a structured and safe way for one thread to request another thread to stop, while giving the target thread full control to finish its current work, release resources, and terminate in an orderly fashion.
4.6.4 Thread-Local Storage (TLS)¶
Threads within the same process share the process's global data and heap memory. This data sharing is a key advantage of multithreading. However, there are situations where each thread needs its own private copy of a specific data item. This private data is known as Thread-Local Storage (TLS).
What is TLS and Why is it Needed?¶
TLS is used when you need data that is global in scope (visible across many functions) but unique to each thread.
Example: Transaction Processing System
- Imagine a server that processes financial transactions, and each transaction is handled by a separate thread.
- Each transaction has a unique identifier. If all threads used a single global variable for this ID, they would overwrite each other's values, causing chaos.
- The solution is to store the transaction ID in TLS. This way, each thread has its own private copy of the
transaction_idvariable. Any function within the thread can access this variable and get the ID for the transaction that specific thread is handling, without interference from other threads.
TLS vs. Local Variables¶
It's important not to confuse TLS with local variables.
- Local Variables: Exist only on the stack for the duration of a single function call. They are not visible to other functions.
- Thread-Local Storage (TLS): Behaves like a global variable in terms of its visibility (it can be accessed from any function in the thread), but it has a separate, unique instance for each thread. Its lifetime is the entire execution of the thread.
TLS in Different Programming Environments¶
Most modern threading libraries and compilers provide built-in support for TLS.
1. Pthreads
- Pthreads uses a key-based mechanism. You create a key (
pthread_key_t) that acts as a universal handle. Each thread can then use this same key to store and retrieve a pointer to its own unique piece of data.pthread_key_create()- Creates a new key.pthread_setspecific()- Stores a pointer in the TLS for the current thread.pthread_getspecific()- Retrieves the pointer from the TLS for the current thread.
2. Java
- Java provides the
ThreadLocal<T>class. This is a type-safe and easier-to-use wrapper around the key concept.set(T value)- Stores a value in the TLS for the current thread.T get()- Retrieves the value from the TLS for the current thread.
3. C#
- C# uses a simple attribute. By adding
[ThreadStatic]before a static variable declaration, you tell the compiler that each thread should get its own copy of that variable.
4. C/C++ (using the gcc compiler)
You can declare a variable as thread-local by using the
__threadstorage class keyword (orthread_localin C11/C++11).static __thread int threadID;
This line declares
threadIDas a static integer, but with a separate, unique instance for every thread. Each thread can read and write its ownthreadIDwithout affecting thethreadIDin any other thread.
In summary: Thread-Local Storage is a crucial feature for writing multi-threaded programs where you need data that is globally accessible within a thread but must remain private to that thread. It combines the scope of a global variable with the isolation of a local variable, making it perfect for storing per-thread context like transaction IDs, user sessions, or other thread-specific state.
4.6.5 Scheduler Activations¶
This section addresses a complex but crucial issue in advanced multithreading models: how the user-level thread library and the operating system kernel coordinate to manage threads efficiently. This coordination is essential for the many-to-many and two-level models (discussed in Section 4.3.3) to work effectively.
The Need for an Intermediate Layer: Lightweight Processes (LWPs)¶
In many-to-many and two-level models, there isn't a direct one-to-one mapping between user threads and kernel threads. To manage this, systems introduce an intermediate data structure called a Lightweight Process (LWP).
Think of an LWP as a "Virtual Processor": From the perspective of the user-thread library, an LWP looks like a virtual CPU core on which it can schedule a user thread to run. Go to Figure 4.20 to see this relationship visually.
The Connection to the Kernel: Each LWP is attached to a single kernel thread. It is these kernel threads that the operating system scheduler actually places onto the physical processors (the real CPU cores).
The Chain of Blocking: If a kernel thread blocks (for example, because it made a system call to read from a disk), the LWP attached to it also blocks. Consequently, the user-level thread that is currently running on that LWP blocks as well.
How Many LWPs Does an Application Need?¶
The number of LWPs assigned to a process determines its level of concurrency.
CPU-bound Application: If an application's threads are mostly doing computation (CPU-bound), and it's running on a single-core machine, it only needs one LWP. Only one thread can run at a time anyway.
I/O-bound Application: If an application has threads that frequently make blocking I/O requests (like reading files or waiting for network data), it needs multiple LWPs. Each thread that executes a blocking system call requires its own LWP to wait on. If you have five threads making simultaneous blocking calls but only four LWPs, the fifth thread will be stuck and cannot proceed until an LWP becomes free.
The Solution: Scheduler Activations and Upcalls¶
Scheduler activation is a specific scheme designed to facilitate intelligent communication between the kernel and the user-thread library. The goal is to allow the kernel to dynamically adjust the number of LWPs for optimal performance.
Here's how it works:
The Kernel Provides Resources: The kernel gives the application a set of virtual processors (LWPs). The application's thread library is responsible for scheduling its many user threads onto these available LWPs.
The Kernel Sends Notifications (Upcalls): The kernel proactively informs the application's thread library about important events. This notification from the kernel to the user-level library is called an upcall. An upcall is essentially a "callback" from the kernel. The function in the thread library that handles this upcall is called an upcall handler, and it must run on an LWP.
The Upcall Process in Action¶
Let's trace a common scenario:
Scenario 1: A Thread is About to Block
- A user thread, running on an LWP, invokes a blocking system call (like
read()). - The kernel, before putting the kernel thread (and thus the LWP) to sleep, makes an upcall to the application. This upcall says: "Hey, thread X is about to block."
- The kernel, recognizing that the process is about to lose an execution resource, allocates a new LWP to the application.
- The application runs an upcall handler on this new LWP. This handler:
- Saves the state of the blocking user thread.
- Marks the original LWP (the one that is now blocked) as being in a "blocked" state.
- Looks at its ready queue and schedules a different, runnable user thread onto the new LWP.
- Result: The application has maintained its level of concurrency. Although one LWP is blocked, a new one was provided, and a different user thread can continue executing.
Scenario 2: A Blocked Thread Becomes Ready
- The I/O operation that the blocked thread was waiting for completes.
- The kernel makes another upcall to the thread library, informing it that the thread is now unblocked.
- This upcall handler also needs an LWP to run on. The kernel may allocate a new one or temporarily preempt a running user thread.
- The upcall handler marks the previously blocked thread as runnable again.
- The thread library can now schedule this newly unblocked thread onto an available LWP.
In summary, scheduler activations create a partnership. The kernel manages the physical resources (CPUs) and informs the user-level library about important events via upcalls. The user-level library, which has the best knowledge of its own threads' needs and states, then makes intelligent scheduling decisions. This cooperation allows the system to dynamically adjust to the workload, ensuring that runnable user threads always have LWPs to run on, even when others are blocked.
4.7 Operating-System Examples¶
This section provides concrete examples of how the threading concepts we've discussed are implemented in real-world operating systems, starting with Microsoft Windows.
4.7.1 Windows Threads¶
In the Windows operating system, every application runs as a separate process, and each process can contain one or more threads. Windows implements the one-to-one threading model (as described in Section 4.3.2), meaning every user-level thread is directly associated with a unique kernel-level thread.
Components of a Windows Thread¶
Each thread in Windows is an executable entity that contains the following key components:
- A Thread ID: A unique number that identifies the thread.
- A Register Set: Represents the current state of the CPU's registers for this thread.
- A Program Counter (PC): Points to the next instruction in the thread to be executed.
- Two Stacks:
- A User Stack: Used when the thread is executing in user mode (running its own code).
- A Kernel Stack: Used when the thread executes in kernel mode (e.g., during a system call).
- A Private Storage Area: Used by various run-time libraries and Dynamic Link Libraries (DLLs).
The register set, stacks, and private storage area are collectively known as the context of the thread. The OS must save and restore this context during every thread switch.
Primary Data Structures of a Windows Thread¶
The Windows kernel manages threads using three main data structures. The relationship between these structures is illustrated in Figure 4.21.
1. ETHREAD (Executive Thread Block)
- Location: Exists entirely in kernel space (only the operating system can access it).
- Key Contents:
- A pointer to the process to which the thread belongs.
- The starting address of the routine where the thread begins execution.
- A pointer to the corresponding KTHREAD structure.
2. KTHREAD (Kernel Thread Block)
- Location: Exists entirely in kernel space.
- Key Contents:
- Scheduling and synchronization information for the thread.
- The kernel stack (used when the thread is running in kernel mode).
- A pointer to the TEB (Thread Environment Block).
3. TEB (Thread Environment Block)
- Location: Exists in user space. This allows the thread to access its own information efficiently while it is running in user mode without needing to switch to kernel mode.
- Key Contents:
- The thread identifier.
- The user-mode stack.
- An array for thread-local storage (TLS), which we discussed in Section 4.6.4.
Summary of the Data Flow (from Figure 4.21): The ETHREAD points to the KTHREAD, which contains the kernel-level details. The KTHREAD, in turn, points to the TEB, which holds the user-accessible information. This separation ensures that user applications can access their own thread-specific data efficiently via the TEB, while the kernel maintains full control and security through the ETHREAD and KTHREAD structures.
4.7.2 Linux Threads¶
Linux takes a unique and philosophically different approach to threads compared to Windows. The core idea is that Linux does not distinguish between processes and threads at the kernel level. Instead, Linux uses the more general term task to refer to any flow of control within a program.
The Unified Model: fork(), clone(), and Sharing¶
Linux provides two primary mechanisms for creating new flows of control:
fork(): This is the traditional system call, described in Chapter 3, which creates a new process. The child process gets a copy of the parent's resources (memory, file descriptors, etc.).clone(): This is the system call used to create what other systems call "threads." The key toclone()is that it allows the programmer to explicitly define what resources are shared between the parent and the new child task.
How clone() Works: The Power of Flags¶
The behavior of clone() is controlled by a set of flags passed as parameters. These flags determine which data structures the parent and child will share. Go to Figure 4.22 to see some of the key flags.
Creating a Thread (High Sharing): If you invoke
clone()with flags like:CLONE_VM(share the same memory space)CLONE_FS(share file-system information like the current working directory)CLONE_FILES(share the same set of open files)CLONE_SIGHAND(share signal handlers) ...then the new task is a thread. It shares most of its resources with the parent, which is the defining characteristic of threads within a process.
Creating a Process (No Sharing): If you invoke
clone()with none of these sharing flags set, the new task does not share resources with its parent. This results in behavior functionally equivalent to thefork()system call, creating a separate process.
Kernel Implementation: struct task_struct¶
This flexible sharing is possible because of how Linux represents a task in the kernel. The kernel maintains a data structure for each task called struct task_struct.
Crucially, this structure does not store the task's data directly. Instead, it contains pointers to other data structures that hold the actual information, such as:
- A structure for the list of open files.
- A structure for signal-handling information.
- A structure for virtual memory (mm_struct).
How fork() and clone() work under the hood:
- When
fork()is called, the kernel creates a newtask_structand then creates new copies of all the data structures that thetask_structpoints to. - When
clone()is called, the kernel creates a newtask_struct, but depending on the flags, it may simply copy the pointers from the parent'stask_structinstead of creating new copies of the data structures themselves. For example, ifCLONE_FILESis set, the new task'stask_structwill point to the same "open files" data structure as its parent.
Extension to Containers¶
The flexibility of clone() goes beyond just threads and processes. The same mechanism is the foundation for containers, a lightweight virtualization technique introduced in Chapter 1.
- Just as certain flags allow you to create a task that shares most resources (a thread) or shares very few (a process), there are other flags (like
CLONE_NEWPID,CLONE_NEWNET) that can be passed toclone()to create an isolated environment known as a container. - These flags create a new, isolated namespace for resources like process IDs, network interfaces, and users, making the task behave as if it's running on its own separate Linux system, even though it's sharing the same kernel.
Containers will be covered in more detail in Chapter 18.
In summary, Linux uses a unified task model. The distinction between a "process" and a "thread" is merely a matter of which resources are shared, controlled by the flags passed to the clone() system call. This design provides immense flexibility, allowing the creation of everything from fully independent processes to tightly-coupled threads and isolated containers, all using the same underlying kernel mechanism.
4.8 Summary¶
This chapter has provided a comprehensive overview of threads and concurrency. Here are the key takeaways:
What is a Thread? A thread is a basic unit of CPU utilization. Threads within the same process share the process's resources, such as code, data, and files.
Benefits of Multithreading: There are four main advantages:
- Responsiveness: Allows a program to remain interactive even if part of it is blocked or performing a long operation.
- Resource Sharing: Threads share memory and resources of their process by default, making data sharing easy.
- Economy: Creating and managing threads is much faster and less resource-intensive than creating new processes.
- Scalability: On multicore systems, multithreading allows an application to run faster by executing in parallel on multiple cores.
Concurrency vs. Parallelism:
- Concurrency: Multiple threads are making progress. This is possible on a single CPU core through time-slicing.
- Parallelism: Multiple threads are executing simultaneously. This requires a multicore system with multiple CPUs.
Challenges of Multithreading: Writing multithreaded programs is difficult. Key challenges include dividing work and data, managing data dependencies, and the complex task of testing and debugging.
Data vs. Task Parallelism:
- Data Parallelism: The same operation is performed on different subsets of the same data across multiple cores.
- Task Parallelism: Different tasks (operations) are distributed across multiple cores.
Threading Models (User-to-Kernel Mapping):
- Many-to-One: Many user threads map to a single kernel thread (limited concurrency).
- One-to-One: Each user thread maps to its own kernel thread (provides more concurrency).
- Many-to-Many: Many user threads are multiplexed to a smaller or equal number of kernel threads (offers flexibility and good concurrency).
Thread Libraries: APIs for creating and managing threads. The primary examples are:
- Pthreads: A standard for POSIX systems (UNIX, Linux, macOS).
- Windows Threads: The native threading API for the Windows operating system.
- Java Threads: Managed within the Java Virtual Machine (JVM), making them portable across different operating systems.
Implicit Threading: A modern approach where the programmer defines tasks, and the compiler, API, or runtime framework (like thread pools, fork-join, or Grand Central Dispatch) automatically handles the creation and management of the threads. This reduces the complexity for the programmer.
Thread Cancellation:
- Asynchronous Cancellation: Immediately terminates the target thread. This is dangerous as it can lead to resource leaks and data corruption.
- Deferred Cancellation: The target thread checks for cancellation requests and terminates itself at a safe point. This is the preferred and safer method.
Linux's Unique Approach: Linux does not have a separate concept for processes and threads at the kernel level. It uses the general term task. The
clone()system call, with its specific flags, determines the level of resource sharing, creating anything from a fully independent process to a tightly-coupled thread.
Chapter 5: CPU Scheduling - Introduction¶
1. What is CPU Scheduling and Why Do We Need It?¶
At its core, CPU scheduling is the fundamental technique that makes multiprogrammed operating systems possible and productive.
- The Core Idea: In a single-processor system, only one process can run on the CPU at any given moment. The job of the CPU scheduler is to decide which one of the many ready-to-run processes gets to use the CPU next.
- The Goal: By rapidly switching the CPU between different processes (a concept known as context switching), the operating system creates the illusion that multiple programs are running simultaneously. This keeps the CPU busy and maximizes overall system productivity, especially when some processes are waiting for I/O operations (like reading a file or waiting for a network packet) to complete.
2. Key Terminology Clarification¶
The text clarifies two important points about terminology that often cause confusion:
- Processes vs. Threads: In modern operating systems, the kernel doesn't actually schedule entire processes. It schedules kernel-level threads. However, the terms "process scheduling" and "thread scheduling" are frequently used to mean the same thing. For this chapter:
- We will use "Process Scheduling" when talking about general, high-level concepts.
- We will use "Thread Scheduling" when a concept applies specifically to threads.
- CPU vs. Core: As you learned in computer architecture, a single CPU can contain multiple computational units called cores. When this chapter says a process is scheduled "on the CPU," it is a simplification. What it really means is that a thread from that process is scheduled to run on one of the CPU's cores.
3. Chapter Objectives: A Roadmap¶
This chapter will guide you through the following key areas:
- Algorithms: You will learn about various CPU scheduling algorithms (like Round-Robin, Priority Scheduling, etc.).
- Evaluation: You will learn the criteria (e.g., waiting time, throughput) to assess and compare these algorithms.
- Advanced Systems: We will explore the complexities of scheduling for systems with multiple processors or multicore CPUs.
- Real-Time Systems: You will be introduced to specialized scheduling algorithms needed for systems with strict timing deadlines (real-time systems).
- Real-World Examples: We will look at how major operating systems like Windows, Linux, and Solaris actually implement scheduling.
- Analysis & Implementation: Finally, you will learn how to use modeling to evaluate scheduling algorithms and even how to design a program that implements them.
5.1 Basic Concepts¶
1. The Core Problem: The CPU is a Single Resource¶
In a system with only one CPU core, a fundamental limitation exists: only one process (or thread) can run at any given moment. All other processes must wait in a queue for their turn to use the CPU. The goal of an operating system is to manage this waiting and switching efficiently.
2. The Goal of Multiprogramming: Maximize CPU Utilization¶
The primary objective of multiprogramming is to keep the CPU as busy as possible by always having some process running. Here's the simple logic behind it:
- The Problem with No Multiprogramming: Imagine a process that needs to read a large file from the hard disk. It sends the request to the disk (an I/O operation), which is very slow compared to the CPU. In a simple system, the CPU would now sit idle, doing nothing, until the disk drive finishes its work. This is a massive waste of the CPU's computational power.
- The Multiprogramming Solution: Instead of letting the CPU sit idle, the operating system keeps multiple processes in memory at the same time. When one process (like our file reader) has to wait for an I/O operation, the OS steps in:
- It takes the CPU away from the waiting process. (This is called a context switch).
- It gives the CPU to another process that is ready to run and has work to do.
This cycle continues constantly. Every time one process is forced to wait, the OS can quickly find another process to take over the CPU, ensuring that productive work is nearly always being done.
3. Extending to Multicore Systems¶
The same principle of keeping the processor busy applies to modern multicore systems (CPUs with multiple cores). The goal is simply scaled up: instead of keeping one CPU core busy, the operating system's job is to keep all of the cores busy by scheduling processes across them.
4. CPU Scheduling is a Fundamental OS Function¶
The text emphasizes that scheduling is not an optional feature; it is a core, fundamental function of any modern operating system. The CPU is one of the computer's most critical resources, and efficiently managing access to it through scheduling is central to the very design of the OS.
5.1.1 CPU–I/O Burst Cycle¶
The Fundamental Pattern of Process Execution¶
The entire concept of CPU scheduling is built upon a key observation about how processes behave: a process's execution is not a single, continuous computation. Instead, it is a cycle that alternates between two main states:
- CPU Burst: A period where the process is actively executing instructions on the CPU.
- I/O Burst: A period where the process is waiting for an input/output operation to complete (e.g., reading from a disk, waiting for user input, writing to a network).
The Cycle of Execution: A process's life is a series of these bursts. It starts with a CPU burst, then an I/O burst, then another CPU burst, and so on. This cycle continues until the process finishes, with its final CPU burst ending with a system request to terminate.
(You should refer to Figure 5.1 in your text, which visually represents this alternating cycle.)
The Importance of Burst Duration¶
Extensive research has been done to measure the length of these CPU bursts. While the exact durations vary, they follow a predictable pattern, as shown in a frequency histogram.
(You should refer to Figure 5.2 in your text, which is a histogram showing the frequency of CPU-burst durations.)
The curve in Figure 5.2 is crucial. It is exponential or hyperexponential, meaning:
- There are a very large number of short CPU bursts.
- There are a relatively small number of long CPU bursts.
This distribution is not just a curiosity; it has a direct impact on scheduling. We can categorize programs based on this pattern:
- I/O-bound programs: These programs spend most of their time waiting for I/O. When they get the CPU, they use it only briefly to set up the next I/O request. Therefore, they generate many short CPU bursts. Examples: a text editor, a web browser waiting for a click.
- CPU-bound programs: These programs spend most of their time performing intensive computations. They use the CPU for long, uninterrupted periods. Therefore, they generate a few long CPU bursts. Examples: a scientific calculation, compiling a large software project.
Understanding this mix of short and long bursts is vital for designing effective CPU-scheduling algorithms, as a good scheduler must handle both types of programs well.
5.1.2 CPU Scheduler¶
The Decision-Maker¶
The CPU scheduler (often just called the short-term scheduler) is the part of the operating system that makes the critical decision of what to run next. Its job is triggered by a specific event: whenever the CPU becomes idle.
For example, the CPU becomes idle when:
- A running process terminates.
- A running process voluntarily gives up the CPU to wait for an I/O operation.
- An interrupt occurs.
At that moment, the scheduler must select one process from the ready queue and allocate the CPU to it.
The Ready Queue¶
The ready queue is the central list of all processes that are resident in memory, ready to execute, and waiting for the CPU.
Important points about the ready queue:
- It's an Abstract Data Structure: The ready queue is not necessarily a simple first-in, first-out (FIFO) line. Depending on the scheduling algorithm, it can be implemented as:
- A FIFO queue
- A Priority queue
- A Tree
- An unordered linked list
- What's in the Queue? The entries in the ready queue are typically the Process Control Blocks (PCBs) of the processes. The PCB is the kernel data structure that holds all the information the OS needs to manage a process (process state, program counter, CPU registers, etc.).
Conceptually, you can imagine all the ready processes lined up, and the CPU scheduler is the manager that picks the next one to run.
5.1.3 Preemptive and Nonpreemptive Scheduling¶
When Does Scheduling Happen?¶
CPU-scheduling decisions are not random; they occur at specific, well-defined moments when a process changes its state. There are four such circumstances:
- Running → Waiting: A process voluntarily gives up the CPU because it needs to wait for something (e.g., an I/O operation, or for another process to finish).
- Running → Ready: A process is forcibly interrupted and moved back to the ready queue (e.g., when a timer interrupt occurs, or a higher-priority process becomes ready).
- Waiting → Ready: A process finishes waiting (e.g., its I/O operation completes) and is moved back to the ready queue.
- Process Terminates: A process finishes its execution and exits.
For circumstances 1 and 4, the OS has no choice—the process is done using the CPU, so the scheduler must pick a new one from the ready queue.
For circumstances 2 and 3, the OS does have a choice. It can either let the currently running process continue, or it can decide to switch to a different one. This choice defines the two main types of scheduling.
Nonpreemptive Scheduling¶
- Definition: Under nonpreemptive (or cooperative) scheduling, the scheduler only takes control during circumstances 1 and 4.
- The Rule: Once a process is given the CPU, it keeps it until it voluntarily releases it by terminating or by switching to the waiting state.
- Analogy: It's like having a meeting where a person holds the microphone and keeps it until they are finished speaking and hand it over.
- Usage: This method is simple but is largely obsolete for general-purpose OS cores. Virtually all modern operating systems (Windows, macOS, Linux, UNIX) use preemptive scheduling.
Preemptive Scheduling¶
- Definition: Under preemptive scheduling, the scheduler can take control during circumstances 2 and 3 as well.
- The Rule: The operating system can forcibly stop a currently running process, even if it is not finished, and give the CPU to another process.
- Analogy: It's like a debate moderator who can take the microphone away from one speaker to give it to another.
- Usage: This is the standard for all modern operating systems because it provides better system responsiveness and ensures that no single process can monopolize the CPU.
The Challenges of Preemption¶
Preemptive scheduling is powerful but introduces two major complications:
1. Race Conditions on Shared Data:
- The Problem: Imagine two processes sharing a variable. Process A is in the middle of updating the variable when it gets preempted. Process B then runs and reads the variable, but it reads an intermediate, inconsistent value because Process A didn't finish its update.
- The Solution: This problem requires synchronization mechanisms (like mutex locks and semaphores) to protect shared data. This is a central topic of Chapter 6.
2. Design of the Operating-System Kernel:
- Nonpreemptive Kernel: In this design, a process executing a system call in the kernel cannot be preempted until it finishes the call or blocks for I/O. This is simple and safe because kernel data structures cannot be left in an inconsistent state by a preemption. However, it is bad for real-time computing as it can cause long, unpredictable delays.
- Preemptive Kernel: In this design, a process can be preempted even while it is executing in the kernel. This is better for responsiveness and real-time systems but requires the kernel itself to be carefully written using synchronization mechanisms (like mutex locks) to protect all its internal data structures from concurrent access. Most modern OS kernels are preemptive.
Handling Interrupts: The text also notes that the kernel must handle interrupts carefully. To prevent corruption, small, critical sections of kernel code will temporarily disable interrupts at the start and re-enable them at the end. This ensures that a sequence of instructions that must run together is not interrupted. These sections are kept very short to avoid missing important hardware signals.
(You should refer to Figure 5.3 in your text, which illustrates the role of the dispatcher, the module that handles the actual context switch after the scheduler has made its decision.)
5.1.4 Dispatcher¶
The "Switch" in Context Switching¶
The CPU Scheduler is the brain that decides which process runs next. The Dispatcher is the muscle that carries out that decision. It is the kernel module responsible for actually transferring control of the CPU core to the newly selected process.
The Dispatcher's Job Description¶
When the dispatcher is activated, it performs a precise sequence of actions:
Performing a Context Switch:
- It saves the state of the currently running process (like its program counter, CPU registers) into its Process Control Block (PCB).
- It loads the saved state of the new process from its PCB into the CPU registers and its program counter into the CPU's program counter register.
Switching to User Mode:
- The CPU switches from privileged kernel mode (where the OS scheduler and dispatcher run) back to unprivileged user mode, where the user process will execute.
Jumping to the User Program:
- Finally, the dispatcher jumps to the location in the user program that was saved in the program counter, resuming the program's execution exactly where it left off.
Dispatch Latency: The Cost of Switching¶
The time it takes for the dispatcher to complete this entire process—stopping one process and starting another—is called the dispatch latency.
(You should refer to Figure 5.3 in your text, which visually breaks down this latency.)
Why is this important? The dispatcher is invoked during every process/thread switch. This time is pure overhead; the CPU is doing administrative work instead of running user programs. Therefore, the dispatcher must be engineered to be extremely fast to minimize this overhead and maximize useful CPU time.
How Often Do Context Switches Occur?¶
Context switching happens very frequently. The text shows how to measure this on a Linux system.
System-Wide View with
vmstat:- The command
vmstat 1provides system statistics every second. - The
cscolumn shows the number of context switches. - In the example, the system averaged 24 context switches per second since boot, with 225 and 339 context switches in two recent one-second intervals. This shows that context switching is a constant and heavy activity.
- The command
Per-Process View with
/proc:- You can examine the file
/proc/<PID>/statusfor any running process. - This file shows two key statistics:
- voluntary_ctxt_switches: This occurs when a process gives up the CPU itself, typically because it needs to wait for an I/O operation. (e.g., 150 times in the example).
- nonvoluntary_ctxt_switches: This occurs when the OS takes away the CPU from the process, such as when its time quantum expires or a higher-priority process becomes ready. (e.g., 8 times in the example).
- You can examine the file
This distinction is crucial. A high number of voluntary switches often indicates an I/O-bound process, while a high number of nonvoluntary switches might indicate a CPU-bound process that the OS must preempt to be fair to others.
5.2 Scheduling Criteria¶
How Do We Judge a Scheduling Algorithm?¶
Different CPU-scheduling algorithms have different strengths and weaknesses. To compare them objectively and decide which one is best for a given situation, we use a set of standard performance metrics. The choice of which metric to prioritize can completely change which algorithm is considered "best."
Here are the key criteria used for evaluation:
The Five Key Metrics¶
CPU Utilization
- What it is: The percentage of time the CPU is busy doing useful work. We want to keep this as high as possible.
- Range: Conceptually 0% to 100%. In a real system, it typically ranges from 40% (lightly loaded) to 90% (heavily used).
- How to measure: You can use commands like
topon Linux/macOS.
Throughput
- What it is: The amount of work completed per unit of time. Specifically, it's the number of processes that finish their execution per second (or minute, etc.).
- Context Matters: For long-running processes, throughput might be low (e.g., 1 process per second). For short transactions, it should be very high (e.g., 1000 processes per second).
Turnaround Time
- What it is: The total time taken from when a process is submitted to the system until it finally completes. This is the process's perspective on its total lifetime in the system.
- Formula: Turnaround Time = (Time of Completion) - (Time of Submission). It is the sum of all time spent waiting in the ready queue, executing on the CPU, and performing I/O.
Waiting Time
- What it is: This is the specific amount of time a process spends waiting in the ready queue. The scheduling algorithm directly affects this metric.
- Crucial Point: The scheduler has no control over how long a process needs to run on the CPU or perform I/O. Its only influence is on how long the process sits in line waiting for its turn. Therefore, minimizing waiting time is a primary goal of a good scheduler.
Response Time
- What it is: This is specific to interactive systems. It is the time from when a request is submitted (e.g., the user presses 'Enter') until the system produces its first response (e.g., the first character appears on the screen). It measures how quickly the system starts to respond, not how long it takes to finish the task.
- Why it matters: For a user, a system that starts responding immediately feels much faster, even if the total job takes the same amount of time.
Optimization Goals and Nuances¶
- The General Rule: We want to maximize CPU Utilization and Throughput, and minimize Turnaround Time, Waiting Time, and Response Time.
- Average vs. Extreme Values: We usually optimize for the average case (e.g., average waiting time). However, sometimes we care about the worst case. For example, to ensure fairness, we might want to minimize the maximum response time any user experiences.
- The Importance of Predictability (Variance): For interactive systems, a predictable and consistent response time is often more desirable than a faster but highly variable one. A system that usually responds in 0.1 seconds but sometimes takes 5 seconds feels sluggish and frustrating. Unfortunately, minimizing variance is a complex problem and not a primary focus of most common scheduling algorithms.
A Note on Our Examples¶
In the following sections, we will illustrate different scheduling algorithms with simplified examples. To keep things clear:
- We will represent each process by a single CPU burst time in milliseconds.
- Our primary metric for comparison will be the average waiting time.
- Remember that real processes have many alternating CPU and I/O bursts, and more complex evaluation methods (discussed in Section 5.8) are needed for a full analysis.
5.3 Scheduling Algorithms¶
Introduction to Scheduling Algorithms¶
This section introduces the core algorithms that the CPU scheduler uses to solve its central problem: deciding which process in the ready queue gets to use the CPU next.
We will explore several different algorithms, each with its own strategy and characteristics. It's important to understand that the choice of algorithm has a direct and significant impact on the performance metrics we just learned about (waiting time, response time, etc.).
A Note on the Scope of Our Discussion¶
To keep the initial explanations clear and focused on the fundamental concepts, we will describe all these algorithms under a specific, simplified assumption:
- We are working with a system that has only a single CPU with a single core.
- This means the system is physically capable of running only one process at a single point in time.
This single-core model allows us to understand the pure logic of each algorithm without the added complexity of managing multiple cores simultaneously. Later, in Section 5.5, we will expand this discussion to cover the more complex scheduling issues that arise in multiprocessor and multicore systems.
5.3.1 First-Come, First-Served (FCFS) Scheduling¶
The Simplest Algorithm¶
The First-Come, First-Served (FCFS) scheduling algorithm is the most straightforward approach. The rule is simple: the process that requests the CPU first is the one that gets it first.
- Implementation: It is easily managed with a First-In, First-Out (FIFO) queue.
- When a process enters the ready queue, its PCB is added to the tail of the queue.
- When the CPU becomes free, it is allocated to the process at the head of the queue.
- That process is then removed from the queue.
- Advantage: The algorithm is very simple to code and understand.
The Problem: Poor Average Waiting Time¶
The main disadvantage of FCFS is that it can lead to very long average waiting times, especially when the order of process arrival is unlucky.
Let's analyze an example with three processes arriving at time 0:
| Process | Burst Time (ms) |
|---|---|
| P1 | 24 |
| P2 | 3 |
| P3 | 3 |
Scenario 1: Order P1, P2, P3
The Gantt chart (a bar chart showing the schedule) would look like this:
|---- P1 -----|-- P2 --|- P3 -|
0 24 27 30
- Waiting Times:
- P1: 0 ms (runs immediately)
- P2: 24 ms (waits for P1 to finish)
- P3: 27 ms (waits for P1 and P2 to finish)
- Average Waiting Time: (0 + 24 + 27) / 3 = 17 ms
Scenario 2: Order P2, P3, P1
Now, let's see what happens if the shorter processes arrive first:
|-- P2 --|- P3 -|---- P1 -----|
0 3 6 30
- Waiting Times:
- P2: 0 ms
- P3: 3 ms
- P1: 6 ms
- Average Waiting Time: (0 + 3 + 6) / 3 = 3 ms
This demonstrates a massive improvement just by changing the order. The key takeaway is that FCFS can result in a very high average waiting time, which is not minimal and is highly dependent on the arrival order of processes.
The Convoy Effect¶
FCFS can cause a serious performance problem known as the convoy effect.
- The Situation: Imagine one CPU-bound process (with a long CPU burst) and many I/O-bound processes (with very short CPU bursts).
- What Happens:
- The CPU-bound process gets the CPU and holds it for a long time.
- Meanwhile, all the I/O-bound processes finish their I/O quickly and enter the ready queue, where they are stuck waiting for the long process.
- During this wait, the now-idle I/O devices are unused.
- Finally, the CPU-bound process finishes and moves to an I/O device. Now the short processes quickly run through their CPU bursts and go back to I/O.
- This leaves the CPU idle until the long CPU-bound process finishes its I/O and gets the CPU again.
- The Result: The short processes are "convoyed" behind the slow, long process, leading to low both CPU and I/O device utilization.
Nonpreemptive Nature¶
The FCFS algorithm is nonpreemptive. Once a process gets the CPU, it keeps it until it voluntarily terminates or blocks for I/O.
- Consequence for Interactive Systems: This makes FCFS terrible for interactive systems. Allowing one process to hold the CPU for an extended period would make the entire system feel unresponsive, as no other process could run until the current one decided to yield.
5.3.2 Shortest-Job-First (SJF) Scheduling¶
The Core Idea¶
The Shortest-Job-First (SJF) scheduling algorithm selects the process with the smallest next CPU burst. When the CPU becomes available, it is assigned to the process that has the shortest upcoming CPU burst.
- Tie-breaking: If two processes have the same next CPU burst length, First-Come, First-Served (FCFS) scheduling is used to break the tie.
- Important Terminology Note: A more accurate name for this algorithm would be "Shortest-Next-CPU-Burst" scheduling, as the decision is based solely on the length of the next burst from the process, not the total job length. However, the term SJF is universally used.
Nonpreemptive SJF Example and Optimality¶
Consider the following set of processes arriving at time 0:
| Process | Burst Time (ms) |
|---|---|
| P1 | 6 |
| P2 | 8 |
| P3 | 7 |
| P4 | 3 |
Using nonpreemptive SJF, the scheduler will pick the process with the shortest burst first. The resulting Gantt chart is:
|- P4 -|- P1 -|- P3 -|-- P2 --|
0 3 9 16 24
- Waiting Time Calculation:
- P1: 3 ms (waits for P4)
- P2: 16 ms (waits for P4, P1, and P3)
- P3: 9 ms (waits for P4 and P1)
- P4: 0 ms (runs immediately)
- Average Waiting Time: (3 + 16 + 9 + 0) / 4 = 7.0 milliseconds
For comparison, the text states that FCFS would yield an average waiting time of 10.25 ms, confirming SJF's superiority here.
Optimality: The SJF scheduling algorithm is provably optimal for minimizing the average waiting time for a given set of processes. The reason is that moving a short process before a long one decreases the short process's waiting time more than it increases the long process's waiting time, thereby reducing the overall average.
The Major Problem: Implementation¶
Although SJF is optimal, it has a critical flaw: there is no way to know the exact length of the next CPU burst for a process. This makes it impossible to implement a perfect SJF scheduler at the CPU scheduling level.
The Solution: Predicting the Next CPU Burst
Since we cannot know the future, we try to predict it. The assumption is that a process's next CPU burst will be similar in length to its previous ones. We use the Exponential Average (or Aging) method to calculate this prediction.
The Exponential Average Formula:
Let:
- ( t_n ) = actual length of the nth CPU burst
- ( τ_{n+1} ) = predicted value for the next CPU burst
- ( α ) = smoothing parameter, where ( 0 ≤ α ≤ 1 )
The formula is:
How the Formula Works:
- ( t_n ) represents the most recent, actual measurement.
- ( τ_n ) stores the entire past history of predictions and measurements.
- The parameter ( α ) controls the weight given to recent history versus past history.
- If ( α = 0 ), then ( τ_{n+1} = τ_n ). Recent history has no effect; the prediction is unchanged.
- If ( α = 1 ), then ( τ_{n+1} = t_n ). Only the most recent CPU burst matters; all past history is ignored.
- Commonly, ( α = 1/2 ), giving equal weight to recent and past history.
(You should refer to Figure 5.4 in your text, which shows a specific example of this prediction calculation with α=1/2 and an initial guess τ₀=10.)
The expanded form of the formula shows that each successive term in the history has less weight than its predecessor, which is why it's called an "exponential" average.
Preemptive SJF (Shortest-Remaining-Time-First)¶
SJF can be implemented in a preemptive version. This is called Shortest-Remaining-Time-First (SRTF) scheduling.
- The Rule: When a new process arrives in the ready queue, its CPU burst time is compared to the remaining CPU burst time of the currently executing process. If the new process has a shorter burst, the current process is preempted.
Preemptive SJF Example:
Consider these processes with arrival times:
| Process | Arrival Time | Burst Time (ms) |
|---|---|---|
| P1 | 0 | 8 |
| P2 | 1 | 4 |
| P3 | 2 | 9 |
| P4 | 3 | 5 |
The resulting preemptive SJF (SRTF) schedule is depicted in the following Gantt chart:
|- P1 -|- P2 -|- P4 -|-- P1 --|---- P3 ----|
0 1 5 10 17 26
Step-by-Step Explanation:
- Time 0: Only P1 is present, so it starts.
- Time 1: P2 arrives. P1 has 7ms remaining. P2 has a burst of 4ms, which is shorter. P1 is preempted, and P2 starts.
- Time 2: P3 arrives. P2 has 3ms remaining. P3 has a burst of 9ms, which is longer. P2 continues.
- Time 3: P4 arrives. P2 has 2ms remaining. P4 has a burst of 5ms, which is longer. P2 continues until it finishes at time 5.
- Time 5: Now we check the ready queue: P1 (7ms remaining), P3 (9ms), P4 (5ms). The shortest remaining time is P4 (5ms), so P4 runs.
- Time 10: P4 finishes. The shortest remaining time is now P1 (7ms), so P1 resumes.
- Time 17: P1 finishes. The only process left is P3 (9ms), so it runs to completion.
Waiting Time Calculation:
- P1: Started at 0, preempted at 1, resumed at 10, finished at 17. Total waiting time = (Time started waiting) = (1-1) for first preemption is not waiting, it's running. Let's calculate correctly: P1's waiting time is from arrival until it first runs (0ms) plus the time it spends ready after being preempted (from time 1 to time 10 = 9ms). So total = 9ms. Alternatively, use the formula: (Finish Time - Arrival Time - Burst Time) = (17 - 0 - 8) = 9ms.
- P2: (Finish Time - Arrival Time - Burst Time) = (5 - 1 - 4) = 0ms.
- P3: (26 - 2 - 9) = 15ms.
- P4: (10 - 3 - 5) = 2ms.
- Average Waiting Time: (9 + 0 + 15 + 2) / 4 = 26/4 = 6.5 milliseconds
- The key point is that preemptive SJF provides a lower average waiting time than the nonpreemptive version, which was 7.75 ms.
5.3.3 Round-Robin (RR) Scheduling¶
The Core Idea: FCFS with Preemption¶
The Round-Robin (RR) scheduling algorithm is designed especially for time-sharing systems. It is similar to FCFS but adds preemption to ensure fairness.
- The Mechanism: A small unit of time, called a time quantum or time slice, is defined. This is typically between 10 and 100 milliseconds.
- The Ready Queue: The ready queue is treated as a circular FIFO queue.
- The Rule: The CPU scheduler goes around the ready queue, allocating the CPU to each process for a maximum of one time quantum.
How RR Scheduling is Implemented¶
The implementation works as follows:
- The CPU scheduler picks the first process from the ready queue.
- It sets a timer to interrupt after one time quantum.
- It dispatches the process.
One of two things then happens:
- Scenario A: The process finishes before the quantum.
- If the process's CPU burst is shorter than the time quantum, it will voluntarily release the CPU by terminating or blocking for I/O before the timer goes off.
- The scheduler then simply proceeds to the next process in the ready queue.
- Scenario B: The process does not finish.
- If the process's CPU burst is longer than the time quantum, the timer will interrupt the process when the quantum expires.
- The OS performs a context switch and puts the interrupted process at the tail of the ready queue.
- The CPU scheduler then selects the next process from the head of the queue.
RR Scheduling Example¶
Consider the following processes arriving at time 0:
| Process | Burst Time (ms) |
|---|---|
| P1 | 24 |
| P2 | 3 |
| P3 | 3 |
Using a time quantum of 4 ms, the resulting RR schedule is:
|- P1 -|- P2 -|- P3 -|- P1 -|- P1 -|- P1 -|- P1 -|- P1 -|
0 4 7 10 14 18 22 26 30
Waiting Time Calculation:
- P1: Was running from 0-4, then waited from time 4 to 10 (for P2 and P3), then ran again, and waited again from 14-18, 18-22, and 22-26. Let's calculate its total waiting time:
- It finishes at time 30.
- Total Waiting Time = (Finish Time - Arrival Time - Total CPU Time) = (30 - 0 - 24) = 6 ms.
- P2: Arrives at 0, starts running at time 4, finishes at 7. Waiting Time = (4 - 0) = 4 ms.
- P3: Arrives at 0, starts running at time 7, finishes at 10. Waiting Time = (7 - 0) = 7 ms.
Average Waiting Time: (6 (P1) + 4 (P2) + 7 (P3)) / 3 = 17/3 = 5.66 milliseconds.
Key Properties of RR¶
- Preemptive: RR is inherently preemptive because processes are interrupted at the end of their time quantum.
- Fairness Guarantee: If there are
nprocesses in the ready queue and the time quantum isq, then each process gets1/nof the CPU time in chunks of at mostqtime units. No process has to wait for more than(n-1) * qtime units to get its next turn.
The Critical Factor: Time Quantum Size¶
The performance of RR depends almost entirely on the size of the time quantum.
- Very Large Quantum: If the time quantum is extremely large (e.g., larger than the longest CPU burst), RR degenerates into FCFS. A process will finish its entire CPU burst in one quantum.
- Very Small Quantum: If the time quantum is extremely small (e.g., 1 ms), the system suffers from a very high rate of context switches.
(You should refer to Figure 5.5 in your text, which shows how a process with a burst of 10 requires more context switches as the time quantum decreases.)
- Overhead: If the context-switch time is about 10% of the time quantum, then a full 10% of CPU time is wasted on overhead. This is unacceptable.
- Practice: In practice, context-switch time is usually less than 10 microseconds, and time quanta are 10-100 ms, so the context-switch overhead is a very small fraction (about 0.1%) of the quantum.
Effect on Turnaround Time¶
(You should refer to Figure 5.6 in your text, which shows the relationship between time quantum size and average turnaround time.)
The average turnaround time does not always improve as the time quantum increases.
- Example: With three processes, each with a 10ms burst:
- If the quantum is 1ms, the average turnaround time is high (29ms) due to many context switches.
- If the quantum is 10ms, the average turnaround time drops to 20ms, as each process finishes in one shot.
The Rule of Thumb: The time quantum should be:
- Large compared to the context-switch time to keep overhead low.
- Not so large that it behaves like FCFS.
- A good guideline is that 80% of CPU bursts should be shorter than the time quantum. This allows most processes to finish their burst in a single quantum, minimizing preemption and context switches while maintaining good response times.
5.3.5 Multilevel Queue Scheduling¶
The Concept: Separating Processes into Classes¶
The previous algorithms assumed all processes were in a single queue. Multilevel Queue Scheduling recognizes that different types of processes have different needs. It partitions the ready queue into several separate queues, each with its own scheduling algorithm.
Multilevel Queue for Priorities¶
This approach is a natural extension of priority scheduling. Instead of having one queue where the scheduler must search for the highest-priority process (an O(n) operation), we create a separate queue for each priority level.
(You should refer to Figure 5.7 in your text, which illustrates separate queues for each priority level, with threads T0-Tz in their respective queues.)
- How it works: The scheduler always runs a process from the highest-priority, non-empty queue.
- Scheduling within a queue: If there are multiple processes in the highest-priority queue, they can be scheduled using another algorithm, most commonly Round-Robin (RR).
- Fixed Priority: In this simple scheme, a process is assigned a priority statically (at creation) and remains in the same queue for its entire lifetime.
Multilevel Queue for Process Types¶
A more common use of multilevel queue scheduling is to partition processes based on their type or purpose, as each type has different response-time requirements.
(You should refer to Figure 5.8 in your text, which shows a common multilevel queue system with queues for real-time, system, interactive, and batch processes.)
A typical division includes these queues, listed from highest to lowest priority:
- Real-time Processes: Require immediate attention to meet deadlines.
- System Processes: OS-level processes that must be responsive.
- Interactive Processes: User-facing applications (editors, shells) that need good response time.
- Batch Processes: Background jobs (compilers, scientific calculations) that have no strict timing needs.
Each of these queues can use the scheduling algorithm best suited to its process type. For example:
- The foreground (interactive) queue might use RR scheduling to ensure good response time.
- The background (batch) queue might use FCFS scheduling because throughput is more important than response time.
Scheduling Between the Queues¶
Now that we have multiple queues, we need a meta-scheduler to decide how to allocate the CPU between them. There are two primary methods:
1. Fixed-Priority Preemptive Scheduling (Most Common)
- The Rule: Each queue has an absolute, fixed priority. The scheduler always runs processes from the highest-priority non-empty queue.
- Preemption: If a process enters a higher-priority queue while a lower-priority process is running, the lower-priority process is immediately preempted.
- Example: In the 4-queue system above, no batch process (queue 4) could run unless the real-time, system, and interactive queues (1, 2, and 3) were all empty. If an interactive process becomes ready while a batch process is running, the batch process is preempted.
2. Time Slicing Between Queues
- The Rule: Each queue gets a predefined percentage of the total CPU time.
- Example: In a foreground-background system:
- The foreground (interactive) queue gets 80% of the CPU time and schedules its processes using RR.
- The background (batch) queue gets 20% of the CPU time and schedules its processes using FCFS.
- This method prevents starvation of lower-priority queues but may not provide the strict prioritization that some system processes require.
5.3.6 Multilevel Feedback Queue Scheduling¶
The Concept: Dynamic and Adaptive Queues¶
The standard Multilevel Queue scheduler is inflexible because processes are permanently assigned to a queue. The Multilevel Feedback Queue algorithm is a more sophisticated and adaptive approach that allows processes to move between queues based on their observed behavior. The core idea is to automatically separate processes according to the actual characteristics of their CPU bursts.
How It Works: The Core Principles¶
The algorithm is designed to achieve two main goals:
- Favor Short Bursts (I/O-bound and Interactive Processes): If a process uses too much CPU time in one go, it is moved to a lower-priority queue. This leaves processes with short CPU bursts (like I/O-bound and interactive ones) in the higher-priority queues where they get quick service.
- Prevent Starvation: A process that waits too long in a lower-priority queue may be moved to a higher-priority queue. This technique, called aging, ensures that no process is permanently ignored.
A Detailed Example¶
Consider a multilevel feedback queue scheduler with three queues (Q0, Q1, Q2), as shown in the figure.
(You should refer to Figure 5.9 in your text, which illustrates this 3-queue system with their respective time quanta.)
Here are the rules for this specific example:
- Queue 0: Highest priority. Uses Round-Robin with a time quantum of 8 ms.
- Queue 1: Medium priority. Uses Round-Robin with a time quantum of 16 ms.
- Queue 2: Lowest priority. Uses FCFS scheduling.
The Scheduling and Migration Rules:
- Initial Placement: A new process is always placed in the highest-priority queue, Queue 0.
- Execution Order: The scheduler always runs all processes in Q0. Only when Q0 is empty will it run processes in Q1. Only when both Q0 and Q1 are empty will it run processes in Q2.
- Preemption: A process entering Q0 will preempt a process running from Q1 or Q2. Similarly, a process entering Q1 will preempt a process running from Q2.
- Demotion (Moving Down):
- If a process in Q0 does not finish its CPU burst within its 8 ms time quantum, it is preempted and moved to the tail of Q1.
- If a process in Q1 does not finish its CPU burst within its 16 ms time quantum, it is preempted and moved to the tail of Q2.
- Promotion (Moving Up - Aging): To prevent starvation, any process that waits too long in Q1 or Q2 may be gradually moved up to a higher queue.
What This Achieves:
- Processes with CPU bursts of 8 ms or less will finish quickly in Q0 and get the best response time.
- Processes needing between 8 and 24 ms will be served quickly in Q0 and Q1, but with slightly lower priority.
- Long, CPU-bound processes will automatically sink to Q2, where they are run in the background with whatever CPU time is left over from the higher-priority queues.
The General Parameters of a Multilevel Feedback Queue¶
This algorithm is the most general and configurable of all the scheduling methods. To define a specific multilevel feedback queue scheduler, you must specify these parameters:
- The number of queues.
- The scheduling algorithm for each queue (e.g., RR, FCFS).
- The method used to determine when to upgrade a process to a higher-priority queue (e.g., after waiting for a certain duration to implement aging).
- The method used to determine when to demote a process to a lower-priority queue (e.g., if it uses up its time quantum).
- The method used to determine which queue a process will enter when that process needs service (e.g., all new processes start in the highest queue).
Advantages and Disadvantages¶
- Advantage: It is the most flexible and adaptive algorithm. It can be tuned to match a specific system's workload perfectly, automatically favoring short jobs while preventing starvation.
- Disadvantage: It is the most complex algorithm. There is no single "best" configuration; finding the right set of parameters (number of queues, time quanta, promotion/demotion rules) for a given system is a difficult design problem.
5.4 Thread Scheduling¶
Recap: User and Kernel Threads¶
As introduced in Chapter 4, operating systems distinguish between user-level and kernel-level threads.
- Kernel-Level Threads: These are the entities that the operating system kernel schedules onto CPU cores. The kernel is aware of and manages them directly.
- User-Level Threads: These are managed entirely by a thread library within a user process. The kernel is unaware of their existence.
For a user-level thread to run on a CPU, it must be mapped to a kernel-level thread, often through an intermediary called a Lightweight Process (LWP). This mapping leads to two different scopes of scheduling competition.
5.4.1 Contention Scope¶
The concept of Contention Scope defines the pool of threads that compete with each other for CPU resources. There are two distinct scopes:
1. Process-Contention Scope (PCS)¶
- When it is used: This scheduling occurs in systems using the many-to-one and many-to-many multithreading models.
- Who does the scheduling: The user-level thread library (not the OS kernel) performs PCS scheduling.
- What it does: The thread library decides which of its user-level threads gets to run on an available LWP.
- The Competition Pool: Competition for the CPU happens only among threads belonging to the same process. Threads from one process do not compete with threads from another process at this level.
Important Clarification: When a thread library schedules a user thread onto an LWP, that thread is not yet running on a CPU. It simply now has a "vehicle" (the LWP and its associated kernel thread) that can be scheduled by the OS. The OS must then schedule that kernel thread onto a physical CPU core using SCS.
How PCS Scheduling Works:
- It is typically priority-based. The thread library selects the runnable user-level thread with the highest priority.
- These priorities are usually set by the programmer and not adjusted by the library.
- PCS is usually preemptive: a higher-priority thread will preempt a lower-priority one.
- However, there is typically no time slicing (Round-Robin) among threads of equal priority within PCS. If a thread with equal priority is running, it may run indefinitely unless it voluntarily yields.
2. System-Contention Scope (SCS)¶
- When it is used: This scheduling is performed by the operating system kernel.
- What it does: The kernel decides which kernel-level thread (from any process in the system) gets to run on a physical CPU core.
- The Competition Pool: Competition for the CPU happens among all threads in the entire system. A thread from one process competes directly with threads from all other processes.
Systems that use the one-to-one threading model (like Windows and Linux) schedule threads using only SCS. In these systems, every user-level thread is mapped directly to a kernel thread, so the kernel has full visibility and control over all threads, making PCS scheduling by a thread library unnecessary.
5.4.2 Pthread Scheduling¶
Contention Scope in the Pthreads API¶
The POSIX Pthreads standard provides an API that allows a programmer to explicitly specify the contention scope for a thread when it is created. This gives the programmer control over whether a thread is scheduled via PCS or SCS.
The two defined contention scope values are:
PTHREAD_SCOPE_PROCESS: Schedules the thread using Process-Contention Scope (PCS).PTHREAD_SCOPE_SYSTEM: Schedules the thread using System-Contention Scope (SCS).
How Contention Scope Affects Thread Mapping¶
The effect of choosing a scope depends on the underlying threading model supported by the operating system:
On systems using the many-to-many model:
PTHREAD_SCOPE_PROCESS: The thread library will schedule the user-level thread onto one of a pool of available Lightweight Processes (LWPs). The number of LWPs is managed by the thread library, potentially using a technique like scheduler activations (see Section 4.6.5).PTHREAD_SCOPE_SYSTEM: This policy will create and bind a dedicated LWP for the user-level thread. This effectively maps the thread using the one-to-one policy, making it directly visible to and scheduled by the OS kernel.
On systems using the one-to-one model (like Linux and macOS):
- The
PTHREAD_SCOPE_SYSTEMpolicy is used by default and is often the only allowed value. Since every user thread is already a kernel thread, there is no separate PCS scheduling; all scheduling is done by the kernel at the system scope.
- The
The Pthreads API Functions¶
Pthreads provides two main functions for managing the contention scope attribute of a thread:
pthread_attr_setscope(pthread_attr_t *attr, int scope)- Purpose: Sets the contention scope in a thread attribute object.
- Parameters:
attr: A pointer to the thread attribute set.scope: The desired scope, eitherPTHREAD_SCOPE_SYSTEMorPTHREAD_SCOPE_PROCESS.
- Return Value: Returns 0 on success, and a nonzero error code on failure.
pthread_attr_getscope(pthread_attr_t *attr, int *scope)- Purpose: Retrieves the current contention scope from a thread attribute object.
- Parameters:
attr: A pointer to the thread attribute set.scope: A pointer to an integer where the current scope value will be stored.
- Return Value: Returns 0 on success, and a nonzero error code on failure.
Code Example Breakdown¶
The provided program demonstrates how to use the Pthread scheduling API:
- Initialization: It first gets the default thread attributes using
pthread_attr_init(). - Inquiry: It checks the current default scheduling scope using
pthread_attr_getscope()and prints out whether it isPTHREAD_SCOPE_PROCESSorPTHREAD_SCOPE_SYSTEM. - Setting the Scope: It sets the scheduling scope in the attribute object to
PTHREAD_SCOPE_SYSTEMusingpthread_attr_setscope(). This means any thread created with this attribute will be bound to its own LWP and scheduled directly by the OS kernel. - Thread Creation: It creates five threads using
pthread_create(), passing the modified attribute object (&attr) to ensure they are all created with the SCS policy. - Waiting and Execution: The main thread waits for all created threads to finish using
pthread_join(). Each thread begins execution in therunner()function.
Important Note: The text reminds us that on some systems, like Linux and macOS, only PTHREAD_SCOPE_SYSTEM is allowed. Attempting to set PTHREAD_SCOPE_PROCESS on these systems would result in an error.
5.5 Multi-Processor Scheduling¶
Introduction: From Single-Core to Multi-Core¶
Our discussion so far has assumed a system with a single CPU core. When multiple processing units are available, load sharing—running multiple threads in parallel—becomes possible. However, scheduling becomes significantly more complex, and there is no single "best" solution.
The definition of a "multiprocessor" has evolved. Traditionally, it meant multiple physical CPUs, each with a single core. On modern systems, it encompasses:
- Multicore CPUs: A single chip with multiple processing cores.
- Multithreaded cores: Cores that can execute multiple threads simultaneously (e.g., Intel's Hyper-Threading).
- NUMA systems: Non-Uniform Memory Access systems, where memory access times depend on the memory location relative to the processor.
- Heterogeneous multiprocessing: Systems where the processors are not identical (e.g., ARM's big.LITTLE, with high-power and low-power cores).
We will first focus on homogeneous systems, where all processors are functionally identical, so any process can run on any CPU. Later, we will cover heterogeneous systems.
5.5.1 Approaches to Multiple-Processor Scheduling¶
There are two primary architectural approaches to organizing the scheduler in a multiprocessor system.
1. Asymmetric Multiprocessing (ASMP)¶
- The Idea: All scheduling decisions, I/O processing, and other system activities are handled by a single, designated processor—the master server. The other processors only execute user code.
- Advantage: Simplicity. Only one core accesses the system data structures (like the ready queue), which drastically reduces the complexity of sharing data and the need for locking.
- Disadvantage: The master server becomes a performance bottleneck. If the master is overloaded, the entire system's performance suffers, as all other processors must wait for scheduling decisions.
2. Symmetric Multiprocessing (SMP) - The Standard Approach¶
- The Idea: Each processor is self-scheduling. Every core independently runs its own scheduler, which examines the ready queue and selects a thread to run.
- Prevalence: This is the standard approach used by virtually all modern operating systems (Windows, Linux, macOS, Android, iOS).
Within SMP, there are two main strategies for organizing the threads that are eligible to be scheduled:
(A) Common Ready Queue (Shared Queue)
- Description: All threads are placed in a single, shared ready queue that all processors access.
- Challenge: Race Conditions. We must ensure that two different processors do not select the same thread to run, and that threads are not lost from the queue. This requires locking mechanisms (covered in Chapter 6).
- Problem: The lock protecting the shared queue can become a high-contention bottleneck, as every processor must acquire it to select a thread, slowing down the entire scheduling process.
(B) Per-Processor Run Queues (Private Queues)
- Description: Each processor has its own private queue of threads. A processor only schedules threads from its own queue.
- Advantage: No contention. Since each processor manages its own queue, no complex locking is needed for thread selection, avoiding the bottleneck of a shared queue.
- Additional Benefit: This can lead to more efficient use of cache memory (as discussed in Section 5.5.4), because a thread is more likely to be re-scheduled on the same processor, reusing data that may still be in that processor's cache.
- Challenge: Load Imbalance. One processor's queue might be empty while another's is full. To solve this, load balancing algorithms are used to migrate threads between queues to equalize the workload.
Conclusion: Due to the performance benefits, SMP with per-processor run queues is the most common approach in modern operating systems.
5.5.2 Multicore Processors¶
From Multiple CPUs to Multiple Cores¶
Traditionally, SMP systems used multiple physical processor chips to achieve parallelism. Now, the standard is the multicore processor: a single physical chip that contains multiple independent computing cores. Each core maintains its own architectural state (registers, etc.), so the operating system sees each core as a separate logical CPU. This is faster and more power-efficient than using multiple single-core chips.
The Problem: Memory Stalls¶
Multicore processors introduce a scheduling complication related to memory speed.
- The Memory Stall: Processors run much faster than memory. When a core needs data that isn't in its local cache (a cache miss), it must wait for the data to be fetched from main memory. This waiting period is called a memory stall. A processor can spend up to 50% of its time stalled, waiting for data.
(You should refer to Figure 5.12 in your text, which illustrates a compute cycle followed by a memory stall cycle.)
The Hardware Solution: Multithreaded Cores¶
To keep cores busy during memory stalls, hardware designers created multithreaded cores. In this design, two or more hardware threads are assigned to a single core. Each hardware thread has its own architectural state (instruction pointer, register set), so it appears to the OS as a separate logical CPU.
- How it works: If one hardware thread stalls waiting for memory, the core can immediately switch to executing another hardware thread that is ready to run. This technique is known as Chip Multithreading (CMT) or, in Intel's terminology, Hyper-Threading or Simultaneous Multithreading (SMT).
(You should refer to Figure 5.13 in your text, which shows how the execution of two threads is interleaved on a single core to hide memory stalls.)
Example: A processor with 4 cores, each supporting 2 hardware threads, provides the operating system with 8 logical CPUs to schedule.
(You should refer to Figure 5.14 in your text, which illustrates this mapping from physical cores and hardware threads to the OS's view of logical CPUs.)
Types of Multithreading¶
There are two primary ways to implement hardware multithreading:
Coarse-Grained Multithreading:
- A thread executes on a core until a long-latency event (like a memory stall) occurs.
- The core then switches to another thread.
- Disadvantage: The cost of switching is high because the core's instruction pipeline must be flushed, causing a significant delay.
Fine-Grained (Interleaved) Multithreading:
- Switching between threads happens at a much finer granularity, often at the boundary of an instruction cycle.
- The core has built-in logic to handle this frequent switching, so the cost of switching between threads is very low.
The Two Levels of Scheduling¶
Multithreaded cores create a hierarchy that requires two different levels of scheduling.
(You should refer to Figure 5.15 in your text, which illustrates these two levels.)
Level 1: Operating System Scheduling (Software Threads -> Hardware Threads)
- This is the scheduling we have been discussing all chapter.
- The OS scheduler decides which software thread to assign to each hardware thread (logical CPU).
- The OS can use any algorithm (RR, Priority, etc.) for this decision.
Level 2: Core Scheduling (Hardware Threads -> Physical Core)
- This is done by the processor core itself.
- The core must decide which of its assigned hardware threads gets to use the core's execution resources at any given moment.
- Strategies vary by processor:
- Simple Round-Robin: Used by processors like the UltraSPARC T3.
- Priority-Based (Urgency): Used by the Intel Itanium. Each hardware thread is assigned a dynamic "urgency" value (0-7). When a triggering event occurs (e.g., a cache miss), the core compares the urgency of its threads and runs the one with the highest priority.
The Importance of OS Awareness¶
The two scheduling levels are not independent. If the OS scheduler is aware of the underlying core architecture, it can make much smarter decisions.
- Example: A CPU has 2 cores, and each core has 2 hardware threads.
- If the OS schedules two busy software threads onto two hardware threads that share the same physical core, they will have to compete for that core's resources (caches, pipelines) and will run more slowly.
- If the OS is "core-aware," it can schedule those two threads onto hardware threads that are on separate physical cores. This avoids resource contention and allows the threads to run in true parallel, leading to better performance.
This concept is known as processor affinity, which we will explore in the next section.
5.5.3 Load Balancing¶
What is Load Balancing and Why is it Needed?¶
In a Symmetric Multi-Processing (SMP) system, where multiple CPUs share the same memory, the goal is to use all processors efficiently. Load balancing is the technique used to distribute work evenly across all these processors.
Why is it necessary? Imagine each CPU has its own private queue of threads (a private "to-do list"). Without load balancing, one CPU might be overwhelmed with a long queue of threads, while another CPU sits idle with an empty queue. This defeats the purpose of having multiple processors. Load balancing ensures that no processor is idle while others are overloaded.
Important Note: Load balancing is only needed in systems where each processor has a private ready queue. If all processors share a single, common run queue (a single shared "to-do list"), load balancing isn't an issue. An idle processor can immediately just pick the next thread from that common queue.
Two Approaches to Load Balancing¶
There are two primary strategies for moving threads between processors to balance the load. They are often used together.
1. Push Migration
- How it works: A dedicated master process or subsystem (like a part of the kernel) periodically checks the load on every processor. If it finds that one processor has significantly more threads in its queue than others, it actively pushes threads from the overloaded processor's queue to the queues of idle or less-busy processors.
- Analogy: Think of a manager (the "pusher") who watches all workers (CPUs). If the manager sees one worker has too many tasks, they take some of those tasks and redistribute them to workers with lighter loads.
2. Pull Migration
- How it works: This approach is initiated by an idle processor. When a processor has nothing to do, it doesn't sit idle; instead, it goes looking for work. It pulls a waiting thread from the queue of a processor that is busy.
- Analogy: Think of a worker (CPU) who finishes their assigned tasks. Instead of taking a break, they go to their busy colleagues and ask, "Can I take one of your tasks to help out?"
These two methods are not mutually exclusive. Most modern operating systems use a combination of both. For example, as the text notes, the Linux CFS scheduler and the FreeBSD ULE scheduler implement both push and pull migration.
What Does a "Balanced Load" Actually Mean?¶
The concept of a "balanced load" is more nuanced than it first appears. It's not always as simple as having the same number of threads in every queue. Here are different ways to define balance:
- Balance by Thread Count: The simplest view. A system is balanced if every processor's ready queue has roughly the same number of threads waiting.
- Balance by Priority: A more sophisticated view. A system is balanced if the priority of the threads is evenly distributed. For instance, you wouldn't want all the high-priority, critical threads on one CPU and all the low-priority background threads on another.
The text also hints that sometimes, these balancing strategies can actually work against the scheduler's goals. For example, constantly moving a thread from one CPU to another can cause cache misses, because the new CPU's cache doesn't contain the data the thread was just working on. This can hurt performance, which is the opposite of what scheduling aims to achieve.
5.5.4 Processor Affinity¶
The Problem: The Cost of Migrating a Thread¶
To understand processor affinity, we need to recall the role of the CPU cache from computer architecture. When a thread runs on a specific processor, the data it accesses gets loaded into that processor's fast cache memory. This creates a "warm cache" where the thread's subsequent memory accesses are very fast because the data is already nearby.
Now, consider what happens if the operating system's load balancer moves this thread to a different processor. The consequences are:
- The first processor's cache, which is now "warm" with the thread's data, is invalidated for that thread.
- The second processor's cache is "cold" for this thread. It must now slowly repopulate its cache by fetching the thread's data from main memory.
This process of invalidating and repopulating caches is very expensive and hurts performance. Because of this high cost, most SMP operating systems try to avoid migrating threads and instead try to keep a thread running on the same processor. This preference is called processor affinity—a thread develops an "affinity" for the processor it has been running on.
How Scheduling Queues Affect Affinity¶
The way the system's ready queue is organized (from Section 5.5.1) has a direct impact on processor affinity:
- Common Ready Queue: Any idle processor can pick any thread from the shared queue. This makes it very likely that a thread will be scheduled on a different processor each time, breaking affinity and causing cache misses.
- Per-Processor Ready Queues (Private Queues): A thread placed on a specific processor's queue will always be scheduled on that same processor. This design provides processor affinity for free, as the thread will naturally benefit from a warm cache.
Soft vs. Hard Affinity¶
Processor affinity comes in two main forms:
1. Soft Affinity
- This is the operating system's policy to try to keep a thread on the same processor, but it makes no guarantee.
- It is a "best effort" approach. The scheduler will try to maintain affinity, but if load balancing demands it, the thread can and will be migrated to another processor.
- This is the most common default behavior in many operating systems.
2. Hard Affinity
- This allows a process or thread to specify a subset of processors on which it is allowed to run.
- The operating system must respect this restriction and cannot migrate the thread outside of this specified set. This is typically done through a system call.
- Example: Linux implements soft affinity by default but also provides the
sched_setaffinity()system call, which allows a programmer to set hard affinity for a thread.
The NUMA Factor¶
The concept of affinity becomes even more critical on systems with a Non-Uniform Memory Access (NUMA) architecture.
- Go to Figure 5.16: This figure shows a NUMA system where there are two physical processor chips, each with its own CPU and its own local, physical memory.
- While all CPUs share a single address space (they can all access all memory), access times are not uniform. A CPU can access its local memory very quickly, but accessing memory that is "local" to another CPU is much slower because it has to travel over a system interconnect.
This has a major scheduling implication: If a thread is scheduled on CPU 1 but the data it needs is in the memory local to CPU 2, it will suffer from slow memory access times.
Therefore, on a NUMA-aware system, the scheduler and memory manager work together. The goal is to:
- Schedule a thread on a specific CPU.
- Allocate the memory for that thread from the pool of memory that is local to that CPU.
This pairing of a thread with its "local" CPU and memory provides the fastest possible memory access.
The Central Conflict: Load Balancing vs. Affinity/NUMA¶
The text highlights a fundamental tension in multiprocessor scheduling:
- The Goal of Processor Affinity & NUMA-Awareness: Keep a thread on one processor to maximize cache warmth and minimize memory access latency.
- The Goal of Load Balancing: Move threads away from their current processor to evenly distribute work.
These two goals directly oppose each other. Load balancing, by moving threads, actively destroys the performance benefits gained from a warm cache and optimal NUMA memory placement.
This conflict is why scheduling algorithms for modern multicore and NUMA systems are extremely complex. They must intelligently balance the need for keeping all CPUs busy (load balancing) against the need for keeping individual thread performance high (by respecting affinity and NUMA locality). The Linux CFS scheduler, which we will explore later, is a prime example of an algorithm that deals with this complex trade-off.
5.5.5 Heterogeneous Multiprocessing¶
Moving Beyond Identical Cores¶
Up until now, we've assumed that in a multiprocessor system, all CPUs or cores are identical (homogeneous). This means any thread can run on any core with the same expected performance, and the main scheduling concerns were load balancing and cache affinity.
Heterogeneous Multiprocessing (HMP) changes this assumption. In an HMP system, the cores are not all identical. They run the same instruction set (so they can execute the same programs), but they differ significantly in:
- Clock Speed (performance)
- Power Consumption (energy efficiency)
- Power Management Capabilities (e.g., the ability to be idled or put into low-power states)
Important Distinction: This is not the same as Asymmetric Multiprocessing (AMP) from Section 5.5.1. In AMP, only certain processors can run the kernel or handle I/O. In HMP, any task (system or user) can run on any core. The difference is that the cores themselves have different performance and power characteristics.
The Goal: Intelligent Power and Performance Management¶
The primary intention behind HMP is to manage power consumption intelligently, which is crucial for mobile and battery-powered devices. The scheduler's job becomes more complex: it must assign tasks to specific cores based not just on readiness, but on the task's performance demands and the core's power profile.
The big.LITTLE Architecture¶
A prominent example of HMP, used in ARM processors, is the big.LITTLE architecture. This design pairs two types of cores:
"big" Cores: These are high-performance cores. They can execute tasks very quickly.
- Trade-off: This high performance comes at the cost of high energy consumption. Because they use so much power, they should only be used for short, intensive bursts to avoid draining the battery and generating excess heat.
"LITTLE" Cores: These are highly energy-efficient cores. They are slower than the "big" cores.
- Trade-off: They sacrifice peak performance for very low power consumption, allowing them to run for long periods without significant battery drain.
Scheduling Strategy in an HMP System¶
The CPU scheduler in an HMP system uses a strategy that maps tasks to the most appropriate core type:
Assigning to "LITTLE" Cores: Tasks that do not require high performance but may run for a long time are assigned to the energy-efficient LITTLE cores. Examples include background synchronization, playing music, or monitoring sensors. This preserves battery life.
Assigning to "big" Cores: Tasks that require high processing power and are interactive are assigned to the high-performance big cores. Examples include launching an app, rendering a complex web page, or playing a game. These tasks get the performance they need for a responsive user experience.
Power-Saving Mode: The system can dynamically disable the energy-intensive "big" cores entirely. When the device is in a power-saving mode, the system can rely solely on the "LITTLE" cores to extend battery life as much as possible.
Operating System Support: Modern OSes like Windows 10 support HMP scheduling. They often provide interfaces that allow a thread to specify a scheduling policy that aligns with its power management needs, giving the scheduler a hint about where it should best be placed.
5.6 Real-Time CPU Scheduling¶
Introduction to Real-Time Systems¶
Real-time operating systems have requirements that are fundamentally different from the general-purpose systems (like your laptop or phone) we've discussed so far. For a real-time system, correctness depends not only on the logical result of the computation but also on the time at which the results are produced.
Missing a timing deadline can lead to a system failure, the severity of which defines the two main categories of real-time systems.
Soft Real-Time Systems¶
- Definition: A soft real-time system provides no strict guarantee about when a critical real-time task will be scheduled and completed.
- Guarantee: The only guarantee offered is that critical real-time processes will be given preference over non-critical, ordinary processes. They will be scheduled first, whenever possible.
- Consequence of Missing a Deadline: In a soft real-time system, missing a deadline is undesirable and degrades the quality of service, but it is not considered a catastrophic failure of the entire system.
- Example: Streaming video or audio. If a frame is late, you might experience a glitch or a skip, but the video keeps playing. The system remains functional.
Hard Real-Time Systems¶
- Definition: A hard real-time system has absolute and strict timing requirements. A task must be serviced within its specified deadline.
- The Critical Rule: Service after the deadline has expired is equivalent to no service at all. In fact, a late result can often be worse than no result, as it may be incorrect or cause the system to take a wrong action.
- Consequence of Missing a Deadline: Missing a deadline is considered a complete and utter system failure.
- Example: The anti-lock braking system (ABS) in a car. If a sensor detects a wheel locking up, the control algorithm must compute and execute the corrective action (pulsing the brake) within a few milliseconds. If this computation misses its deadline, the wheel remains locked, and the car may skid, leading to an accident. The system has failed in its primary function.
This section will explore the specific scheduling algorithms and challenges involved in meeting these stringent timing demands for both soft and hard real-time systems.
5.6.1 Minimizing Latency¶
The Core Concept: Event Latency¶
Real-time systems are fundamentally event-driven. They sit idle, waiting for specific events to occur. These events can be:
- Software Events: Such as a timer expiring.
- Hardware Events: Such as a sensor detecting an obstacle.
When an event occurs, the system must respond to it and perform the required service as quickly as possible. The time delay between the event occurring and it being serviced is called Event Latency.
- Go to Figure 5.17: This figure illustrates event latency. At time
t0, the event E first occurs. At timet1, the real-time system finally responds to it. The period betweent0andt1is the event latency.
Different systems have vastly different latency requirements. An anti-lock brake system must respond within 3-5 milliseconds, while an aircraft's radar system might tolerate a latency of several seconds.
The Two Critical Types of Latency¶
The total event latency is composed of two main parts that the operating system must minimize.
1. Interrupt Latency
Definition: The period of time from when an interrupt signal arrives at the CPU to when the specific Interrupt Service Routine (ISR) that handles it actually begins executing.
What happens during this time? The OS cannot instantly respond to the interrupt. It must:
- Finish executing the current CPU instruction.
- Determine the type of interrupt that occurred.
- Save the state of the current process (so it can be resumed later) before jumping to the ISR.
The total time to perform these tasks is the interrupt latency.
Go to Figure 5.18: This figure shows a task T running. An interrupt occurs, but the ISR does not start immediately. The period between the interrupt and the start of the ISR is the interrupt latency, which includes the time to determine the interrupt type and perform a context switch.
- Minimization: For real-time systems, especially hard real-time systems, it is not enough to just make this latency small on average; it must have a known, strict upper bound (bounded). A key factor that increases interrupt latency is if the OS disables interrupts for too long while it is updating critical kernel data structures. Real-time OSes keep these periods extremely short.
2. Dispatch Latency
Definition: The amount of time required for the scheduler's dispatcher to stop one process and start another one running.
The Goal: To get the time-critical, high-priority process onto the CPU as fast as possible once it is ready. The most important technique for achieving low dispatch latency is using a preemptive kernel, which allows a higher-priority process to interrupt a lower-priority one even if it is executing a kernel call. In hard real-time systems, dispatch latency is often measured in microseconds.
Go to Figure 5.19: This figure breaks down the response interval (the total time from event to response) and highlights the components of dispatch latency within it.
The Components of Dispatch Latency: The Conflict Phase¶
Dispatch latency is not a single action. Its most complex part is the conflict phase, which consists of two components:
Preemption of any process running in the kernel: The dispatcher must first preempt (stop) whatever process is currently running on the CPU, even if that process is executing a system call in the kernel.
Release by low-priority processes of resources needed by a high-priority process: This is a more serious issue. If a low-priority process holds a resource (like a lock on a shared data structure) that the high-priority real-time task needs, the high-priority task must wait for the low-priority task to release that resource. This situation, where a higher-priority task is blocked by a lower-priority one, is a major problem in real-time systems.
Once the conflict phase is resolved, the actual dispatch phase (switching the context to the new process) is relatively fast. The primary challenge is minimizing the delays caused by the conflict phase.
5.6.2 Priority-Based Scheduling¶
The Foundation of Real-Time Scheduling¶
The most critical requirement for a real-time operating system is to respond immediately to a real-time process the moment it needs the CPU. To achieve this, the scheduler must be built on two pillars:
- Priority-Based Scheduling
- Preemption
This means each process is assigned a priority, with more important (real-time) tasks getting higher priorities. Furthermore, if a higher-priority process becomes ready to run, it will immediately preempt (kick off) whatever lower-priority process is currently using the CPU.
Soft Real-Time in Practice¶
As discussed in Section 5.3.4, preemptive, priority-based scheduling is the standard mechanism used. Major operating systems implement this for soft real-time support:
- Windows: Has 32 priority levels. Levels 16 to 31 are specifically reserved for real-time processes.
- Solaris and Linux: Use similar schemes, where real-time processes are assigned the highest possible scheduler priorities.
Crucial Distinction: It is vital to understand that providing a preemptive, priority-based scheduler only guarantees soft real-time functionality. It gives real-time tasks preference but makes no absolute promise that they will always meet their deadlines. For hard real-time systems, where missing a deadline is a system failure, we need more sophisticated schedulers that can make and keep such guarantees.
Characterizing Processes for Hard Real-Time Scheduling¶
To build a scheduler for hard real-time systems, we first need to define the processes more formally. These processes are typically periodic, meaning they require the CPU at constant, predictable intervals. Each periodic process is characterized by three parameters:
- Processing Time (
t): The fixed worst-case execution time the process needs on the CPU during each period. - Period (
p): The fixed time interval between consecutive arrivals of the process.
The relationship between these parameters is: 0 ≤ t ≤ d ≤ p
- Go to Figure 5.20: This figure illustrates a periodic task. You can see the task arriving and executing at regular intervals (period1, period2, period3). The processing time
tis the shaded portion where it runs, and it always finishes within its deadlinedfor that period. The rate of the task (how often it occurs) is 1/p.
The Admission Control Paradigm¶
Hard real-time scheduling introduces a unique concept not found in general-purpose scheduling: Admission Control.
Here is how it works:
- A process must announce its real-time requirements to the scheduler. It essentially says, "I am a periodic task with period
p, deadlined, and required computation timet." - The scheduler then runs an admission-control algorithm. This algorithm performs a feasibility test.
- The scheduler then makes a definitive decision:
- ADMIT: If the scheduler can guarantee that the new task can be scheduled without causing itself or any existing task to miss a deadline, it admits the process and makes that guarantee.
- REJECT: If the scheduler determines that it cannot make this guarantee, it rejects the process's request as impossible to satisfy. The task may not be run at all.
This "admit or reject" strategy is fundamental to hard real-time systems. It allows the system to know with certainty that all admitted tasks will meet their deadlines, which is the core requirement of a hard real-time environment.
5.6.3 Rate-Monotonic Scheduling¶
The Algorithm: Priority by Period¶
Rate-Monotonic Scheduling (RMS) is a classic algorithm for scheduling periodic tasks in hard real-time systems. It operates as a static priority preemptive scheduler.
- Static Priority: A task's priority is fixed and assigned when it is created. It does not change during execution.
- Preemptive: A higher-priority task can always take the CPU away from a lower-priority task.
The core rule of RMS is simple: Assign a priority inversely based on the task's period.
- Shorter Period = Higher Priority
- Longer Period = Lower Priority
Why this rule? The rationale is that a task with a short period needs the CPU more frequently. To ensure it can meet all its frequent deadlines, it is given a higher priority. RMS also makes an important assumption: the processing time (CPU burst) of every periodic task is the same each time it runs.
An Example: Scheduling Two Tasks¶
Let's define two periodic processes:
- Process P1: Period
p1 = 50, Processing Timet1 = 20 - Process P2: Period
p2 = 100, Processing Timet2 = 35
The deadline for each task is the start of its next period. First, we check the total CPU load:
- CPU utilization of P1 =
t1/p1= 20/50 = 0.40 (40%) - CPU utilization of P2 =
t2/p2= 35/100 = 0.35 (35%) - Total CPU Utilization = 75%
Since the total utilization is less than 100%, it seems theoretically possible to schedule both tasks.
Scenario 1: Incorrect Priority Assignment¶
Let's see what happens if we incorrectly assign P2 a higher priority than P1.
- Go to Figure 5.21: This figure shows the schedule under this policy.
- Time 0-35: P2 (higher priority) runs first and completes its full 35 ms burst.
- Time 35-55: P1 (lower priority) finally starts. It runs for 20 ms, finishing at time 55.
- The Problem: P1's first deadline was at time 50 (the start of its next period). Since it finished at time 55, it has missed its deadline. This shows that even with low total CPU usage, a wrong scheduling policy can cause failures.
Scenario 2: Rate-Monotonic Scheduling¶
Now, let's apply the RMS rule. Since P1 has a shorter period (50 vs. 100), it is assigned a higher priority than P2.
- Go to Figure 5.22: This figure shows the correct RMS schedule.
- Time 0-20: P1 (higher priority) runs first and completes its 20 ms burst well before its deadline of 50.
- Time 20-50: P2 (lower priority) begins running. It runs for 30 ms.
- Time 50: This is a critical moment. P1 becomes ready again at the start of its new period. Because P1 has a higher priority, it preempts P2, even though P2 has only used 30 of its 35 ms of CPU time.
- Time 50-70: P1 runs its second CPU burst, finishing at time 70.
- Time 70-75: P2 resumes and completes its remaining 5 ms of computation.
- Checking Deadlines:
- P1 finished its first burst at time 20 (deadline was 50) and its second burst at time 70 (deadline was 100). All deadlines met.
- P2 finished its first burst at time 75 (deadline was 100). Deadline met.
By following the rate-monotonic rule (shorter period = higher priority), both tasks successfully meet their first set of deadlines, which the incorrect priority assignment failed to do. Of course. Here is the explanation for the second part of Section 5.6.3 for your handbook.
The Limits of Optimality¶
The previous example showed Rate-Monotonic Scheduling (RMS) working correctly. It is considered optimal among all static-priority schedulers. This means that if a set of periodic tasks cannot be scheduled using RMS (where priority is the inverse of the period), then that set of tasks cannot be scheduled using any other static-priority assignment policy.
However, "optimal" does not mean "able to schedule everything." RMS has fundamental limits.
An Example Where RMS Fails¶
Let's examine a set of two processes that RMS cannot successfully schedule:
- Process P1: Period
p1 = 50, Processing Timet1 = 25 - Process P2: Period
p2 = 80, Processing Timet2 = 35
Following the RMS rule, P1 has a shorter period and thus gets a higher priority than P2.
First, check the CPU utilization:
- P1 utilization = 25/50 = 0.50 (50%)
- P2 utilization = 35/80 = 0.4375 (43.75%)
- Total CPU Utilization = 0.9375 (93.75%)
Since utilization is less than 100%, it seems possible. However, let's see the actual schedule.
- Go to Figure 5.23: This figure shows the timeline where deadlines are missed.
- Time 0-25: P1 (high priority) runs and completes its full 25 ms burst.
- Time 25-50: P2 (low priority) starts and runs for 25 ms. It has used 25 of its 35 ms needed.
- Time 50: P1 becomes ready again at the start of its next period. It preempts P2.
- Time 50-75: P1 runs its next full 25 ms burst.
- Time 75-85: P2 resumes and finishes its remaining 10 ms.
- The Problem: P2's first deadline was at time 80 (the start of its next period). However, P2 does not finish its required computation until time 85. Therefore, P2 has missed its deadline.
This demonstrates a critical point: even with a total CPU utilization less than 100% (here, 93.75%), a schedule that meets all deadlines may not exist with a static-priority scheme.
The CPU Utilization Bound¶
The previous example leads to the most important limitation of RMS: there is a mathematical upper bound on the total CPU utilization for which RMS can guarantee all deadlines.
This worst-case utilization bound for scheduling N processes is: N(2^(1/N) - 1)
Let's calculate this bound for different values of N:
- For N = 1 process: 1 * (2^(1/1) - 1) = 1 * (2 - 1) = 1.00 (100%)
- For N = 2 processes: 2 * (2^(1/2) - 1) = 2 * (√2 - 1) ≈ 2 * (1.414 - 1) = ~0.828 (82.8%)
- For N = 3 processes: 3 * (2^(1/3) - 1) ≈ 3 * (1.26 - 1) = ~0.78 (78%)
- As N approaches infinity: The bound approaches ln(2) ≈ 0.693 (69.3%)
What this means in practice: If the total CPU utilization of your set of tasks is below this bound, RMS is guaranteed to be able to schedule them successfully.
- In Figure 5.21/5.22, the utilization was 75%, which is below the 82.8% bound for two tasks. RMS worked.
- In Figure 5.23, the utilization was ~94%, which is above the 82.8% bound for two tasks. RMS failed, and any other static-priority scheduler would also fail.
This utilization bound is why admission control in hard real-time systems is not just about checking for 100% utilization; it must check against this stricter, lower bound to provide a guarantee.
5.6.4 Earliest-Deadline-First Scheduling¶
The Algorithm: Dynamic Priority by Deadline¶
Earliest-Deadline-First (EDF) scheduling is a dynamic priority algorithm used for hard real-time systems. Its core principle is:
- The earlier the deadline, the higher the priority.
- The later the deadline, the lower the priority.
This is a fundamental shift from Rate-Monotonic Scheduling (RMS).
- In RMS, priorities are static (fixed based on the period at the start).
- In EDF, priorities are dynamic and can change whenever a new process becomes runnable.
To make this work, a process must announce its deadline requirement to the scheduler as soon as it becomes ready to run. The scheduler then adjusts the priorities of all runnable tasks to reflect the new deadlines.
EDF in Action: Solving the RMS Problem¶
Let's apply EDF to the same problem that caused RMS to fail in Figure 5.23. The processes are:
- Process P1: Period
p1 = 50, Processing Timet1 = 25(Deadlines at 50, 100, 150...) - Process P2: Period
p2 = 80, Processing Timet2 = 35(Deadlines at 80, 160...)
The total CPU utilization is still 93.75%.
- Go to Figure 5.24: This figure shows the EDF schedule, which successfully meets all deadlines.
- Time 0-25: At time 0, both P1 and P2 are ready. P1's first deadline is 50, and P2's first deadline is 80. Since 50 is earlier than 80, P1 has a higher priority and runs first, finishing at time 25.
- Time 25-50: Only P2 is runnable, so it runs.
- Time 50: The Critical EDF Decision Point. P1 becomes ready again for its second burst (with a new deadline of 100). We must now compare the deadlines of the currently runnable tasks:
- P2 (which has run from time 25-50) still has a deadline of 80.
- P1 (which just arrived) has a new deadline of 100.
- Since 80 is earlier than 100, P2 now has a higher priority than P1. Therefore, P1 does NOT preempt P2. This is the key difference from RMS.
- Time 50-60: P2 continues running, meeting its first deadline at time 80.
- Time 60-85: Now P2 is done, so P1 runs its second burst, finishing at time 85, well before its deadline of 100.
- Time 85-100: The system is idle.
- Time 100: Another Critical Decision Point. P1 becomes ready for its third burst (new deadline 150). P2 also becomes ready for its second burst (new deadline 160).
- P1's deadline (150) is earlier than P2's (160), so P1 has a higher priority and preempts any other task.
- Time 100-125: P1 runs its third burst, finishing at time 125.
- Time 125-145: P2 resumes and runs its second burst, finishing at time 145, meeting its deadline of 160.
As the timeline shows, EDF successfully navigates the high-utilization scenario where RMS failed.
Advantages and Theoretical Optimality of EDF¶
EDF has several significant advantages over RMS:
Greater Flexibility: It does not require processes to be periodic, nor does it require that a process uses the same amount of CPU time in every burst. The only requirement is that a process announces its deadline whenever it becomes runnable.
Theoretical Optimality: EDF is theoretically optimal among all scheduling algorithms on a single processor. This means that if a set of tasks is schedulable (i.e., it is possible to meet all deadlines), then EDF is capable of producing a valid schedule.
High Utilization Bound: Because it is optimal, EDF can theoretically achieve 100% CPU utilization while still meeting all deadlines. The utilization-based test for EDF is simple: if the total CPU utilization of all tasks is ≤ 100%, then the task set is schedulable using EDF.
Practical Limitation: In practice, 100% utilization is impossible to achieve because of the overhead costs of context switching between processes and handling interrupts. However, the utilization bound for EDF is still significantly higher than the ~69% bound for RMS as the number of tasks grows large, making EDF a much more powerful algorithm for hard real-time systems.
5.6.5 Proportional Share Scheduling¶
The Core Concept: Dividing Time with Shares¶
Proportional Share Scheduling is a different approach to real-time scheduling. Instead of being driven by deadlines (like EDF) or fixed periods (like RMS), it focuses on guaranteeing a specific fraction or percentage of the CPU time to each process or application.
The system works by defining a total number of T shares (or tickets) that represent the entire processing resource of the CPU. Each application is then allocated a certain number of these shares, N. This allocation guarantees that the application will receive at least N/T of the total processor time.
A Detailed Example¶
Let's walk through the example from the text:
- The total number of shares in the system is T = 100.
- We have three processes with the following share allocations:
- Process A: 50 shares
- Process B: 15 shares
- Process C: 20 shares
From these allocations, we can calculate the guaranteed proportion of CPU time for each process:
- Process A: 50 / 100 = 50% of the CPU time.
- Process B: 15 / 100 = 15% of the CPU time.
- Process C: 20 / 100 = 20% of the CPU time.
The sum of the allocated shares is 50 + 15 + 20 = 85 shares. This means 15 shares are still unallocated and represent free, unreserved CPU capacity.
The Role of Admission Control¶
A Proportional Share scheduler must be used with a strict admission-control policy to work correctly. The guarantee that a process gets its N/T of the CPU time is only possible if the system does not oversell the total available shares.
The job of the admission controller is to admit a new process only if there are enough free shares to satisfy its request.
- In our example: A new Process D requests 30 shares.
- Admission Control Check: The system has only 100 - 85 = 15 free shares available.
- Decision: Since 30 > 15, the admission controller denies the request from Process D. It is not allowed to run under the proportional share scheduler because granting its request would make it impossible to honor the existing guarantees given to Processes A, B, and C.
This "admit or reject" policy is essential for the scheduler to provide mathematically sound performance guarantees, similar to the admission control used in hard deadline scheduling.
5.6.6 POSIX Real-Time Scheduling¶
Introduction to the POSIX.1b Standard¶
To allow for real-time computing in a standardized way, the POSIX standard includes an extension known as POSIX.1b. This defines a set of Application Programming Interfaces (APIs) that allow a programmer to control how threads are scheduled. We will focus on the APIs related to scheduling classes.
The POSIX Real-Time Scheduling Classes¶
POSIX defines two primary scheduling classes designed for real-time threads:
SCHED_FIFO(First-In, First-Out)- This policy schedules threads based on a first-come, first-served policy within a given priority level, using a FIFO queue as described in Section 5.3.1.
- Crucial Detail: There is no time slicing (round-robin) among threads of the same priority.
- Consequence: The highest-priority real-time thread at the front of its queue will be granted the CPU and will run until it:
- Voluntarily terminates.
- Blocks (e.g., for I/O or on a lock).
- Is preempted by a higher-priority thread that becomes ready.
SCHED_RR(Round Robin)- This policy is similar to
SCHED_FIFOin that it uses priority-based queues. - Key Difference: It does provide time slicing among threads of equal priority.
- Consequence: If multiple threads with the same priority are ready, they are granted the CPU in turn, each for a specific time quantum. This prevents a single compute-bound thread from monopolizing the CPU at the expense of other same-priority threads.
- This policy is similar to
SCHED_OTHER- This is an additional scheduling class, but its implementation is undefined and system-specific by the POSIX standard.
- It represents the default, non-real-time scheduling policy (in Linux, this is the standard CFS scheduler). Its behavior can vary across different operating systems.
The POSIX API for Scheduling Policy¶
POSIX specifies functions to get and set the scheduling policy for a thread. These functions manipulate the thread's attribute object, which is used when creating a thread.
The Functions:
pthread_attr_getschedpolicy(pthread_attr_t *attr, int *policy)- Purpose: This function gets the current scheduling policy from a thread attribute object.
- Parameters:
*attr: A pointer to the thread attribute structure.*policy: A pointer to an integer where the current policy (SCHED_FIFO,SCHED_RR, etc.) will be stored.
- Return Value: Returns a nonzero value if an error occurs.
pthread_attr_setschedpolicy(pthread_attr_t *attr, int policy)- Purpose: This function sets the scheduling policy in a thread attribute object.
- Parameters:
*attr: A pointer to the thread attribute structure.policy: An integer value specifying the desired policy (SCHED_FIFO,SCHED_RR, etc.).
- Return Value: Returns a nonzero value if an error occurs.
How they are used: The text refers to Figure 5.25, which shows a sample program. The typical flow is:
- The program first uses
pthread_attr_getschedpolicy()to check the current scheduling policy. - It then uses
pthread_attr_setschedpolicy()to change the policy toSCHED_FIFO(or another real-time policy). - When a thread is created using this modified attribute object, it will run under the newly set scheduling policy.
5.7 Operating-System Examples¶
5.7.1 Example: Linux Scheduling¶
A History of Linux Schedulers¶
The Linux scheduler has evolved significantly over time to address changing hardware and workload demands.
1. The Pre-2.5 Era: Traditional UNIX Scheduler
- Early versions of Linux used a variation of the classic UNIX scheduling algorithm.
- Major Flaws:
- It was not designed with Symmetric Multi-Processing (SMP) in mind. It did not handle multiple processors effectively.
- Its performance degraded badly when the system had a large number of runnable processes.
2. The O(1) Scheduler (Kernel Version 2.5 and 2.6)
- This was a major overhaul to fix the problems of the old scheduler. Its name, "O(1)", describes its key feature: its operations (such as selecting a task to run) took constant time, regardless of how many tasks were in the system. This made it very efficient on large systems.
- Key Features it Introduced:
- SMP Support: It properly supported multiple processors, including:
- Processor Affinity: To keep tasks on the same CPU and benefit from warm caches.
- Load Balancing: To distribute work evenly across all CPUs.
- SMP Support: It properly supported multiple processors, including:
- The Problem: In practice, while the O(1) scheduler was excellent for server workloads and SMP performance, it often resulted in poor response times for interactive processes (like those on a desktop), leading to a less responsive user interface.
3. The Completely Fair Scheduler (CFS) (Kernel Version 2.6.23 and later)
- Due to the interactive performance issues of the O(1) scheduler, it was replaced by the Completely Fair Scheduler (CFS), which became the default scheduler starting with kernel release 2.6.23 and remains the default in modern Linux kernels.
- CFS was designed to provide superior fairness and responsiveness, especially for interactive tasks, while still performing well on SMP systems.
POSIX Real-Time Scheduling API Code Example¶
- Go to Figure 5.25: This C program demonstrates the use of the POSIX real-time scheduling API we discussed in Section 5.6.6.
What the code does, step-by-step:
- Initialization: It gets the default thread attributes using
pthread_attr_init(). - Get Policy: It uses
pthread_attr_getschedpolicy()to retrieve the current scheduling policy (e.g.,SCHED_OTHER,SCHED_RR,SCHED_FIFO) and prints it out. - Set Policy: It uses
pthread_attr_setschedpolicy()to change the scheduling policy for the new threads toSCHED_FIFO. - Create Threads: It creates five threads using
pthread_create(). These new threads will inherit theSCHED_FIFOpolicy from the attribute objectattr. - Wait for Threads: The main thread uses
pthread_join()to wait for all five threads to finish their execution. - Thread Function: Each of the five threads begins executing in the
runner()function, where they would perform their work.
The Scheduling Class Framework¶
Modern Linux scheduling is built on a modular framework of scheduling classes. Each class represents a different scheduling algorithm and is assigned a specific priority level.
- This design allows the kernel to use different scheduling algorithms for different types of tasks. For example, the needs of a Linux server differ from those of a mobile device, and this framework can accommodate that.
- The scheduler decides which task to run next with a simple rule: it looks at the highest-priority scheduling class that has runnable tasks and then selects the highest-priority task within that class.
- Standard kernels implement two main classes:
- The default class, which uses the Completely Fair Scheduler (CFS) algorithm.
- A real-time scheduling class.
- The framework is extensible, allowing new scheduling classes to be added.
The Completely Fair Scheduler (CFS) Algorithm¶
The CFS scheduler's goal is to be "fair," giving each task a fair share of the CPU. It does this in a unique way.
1. Proportion-Based Scheduling and Nice Values
- Instead of assigning fixed time slices, CFS assigns each task a proportion of the CPU processing time.
- This proportion is determined by the task's nice value. Nice values range from -20 to +19.
- A lower nice value means a higher relative priority.
- The default nice value is 0.
- Context: The term "nice" comes from the idea that a task that increases its nice value (e.g., from 0 to +10) is being "nice" to other tasks by lowering its own priority.
- CFS uses a value called targeted latency, which is the time interval during which every runnable task should get to run at least once. The CPU time proportion for each task is calculated from this targeted latency. This latency can increase if there are many active tasks to maintain efficiency.
2. The Core Mechanism: Virtual Run Time (vruntime)
- CFS does not directly use nice values to pick the next task. Instead, it maintains a key metric for each task called the virtual run time (vruntime).
vruntimemeasures how long a task has run, but it is weighted by the task's priority.- For a task with default priority (nice 0),
vruntimeincreases at the same rate as real, physical time. - For a higher-priority task (lower nice value, e.g., -10),
vruntimeincreases more slowly than real time. This makes itsvruntimevalue accumulate more slowly, so it gets to run more often. - For a lower-priority task (higher nice value, e.g., +10),
vruntimeincreases more quickly than real time. This makes itsvruntimeaccumulate faster, so it gets to run less often.
- For a task with default priority (nice 0),
- The Scheduling Decision is Simple: The scheduler always picks the runnable task with the smallest
vruntimevalue. This task is the "most deserving" of CPU time. Preemption also works this way: if a sleeping task wakes up and has a smallervruntimethan the currently running task, it will preempt it.
3. CFS in Action: I/O-bound vs. CPU-bound Consider an I/O-bound task (like a text editor) and a CPU-bound task (like a video encoder) with the same nice value.
- The I/O-bound task runs in short bursts and then blocks for I/O. While it's blocked, its
vruntimedoesn't increase. - The CPU-bound task runs for long periods, continuously increasing its
vruntime. - Result: After the I/O-bound task finishes waiting, it will have a much lower
vruntimethan the CPU-bound task. CFS will then immediately schedule the I/O-bound task, making the system feel responsive and interactive. This automatically gives I/O-bound tasks a priority boost without needing complex heuristics.
Real-Time Scheduling in Linux¶
Linux implements real-time scheduling according to the POSIX standard (SCHED_FIFO and SCHED_RR), as described in Section 5.6.6.
- Real-time tasks always have higher priority than normal CFS tasks.
- Linux uses two separate, unified priority ranges:
- Go to Figure 5.26: This figure illustrates the Linux priority scheme.
- Real-Time Priorities: Range from 0 (highest) to 99.
- Normal Priorities: Range from 100 to 139 (lowest).
- The nice values for normal tasks (-20 to +19) map directly onto the priority range 100 to 139. A nice value of -20 equals priority 100, and a nice value of +19 equals priority 139. This creates a single, global priority scale where a lower number always means a higher priority.
CFS Performance: The Red-Black Tree¶
The Data Structure for Efficiency¶
The Linux CFS scheduler needs a very efficient way to manage its list of runnable tasks and to quickly identify the one with the smallest vruntime. Instead of using a standard linear queue, it uses a sophisticated data structure called a red-black tree.
What is a Red-Black Tree?
- It is a self-balancing binary search tree.
- "Binary search tree" means each node has at most two children, and nodes are organized such that for any node, all keys in its left subtree are less than its key, and all keys in its right subtree are greater.
- "Self-balancing" means the tree automatically reorganizes itself during insertions and deletions to prevent it from becoming a long, inefficient chain. This guarantees that operations remain efficient.
How CFS Uses the Red-Black Tree¶
In the CFS scheduler:
- Each node in the tree represents a runnable task.
- The key used to organize the tree is the task's
vruntimevalue.
Organization of the Tree:
Tasks with smaller
vruntime(more deserving of CPU time) are placed toward the left side of the tree.Tasks with larger
vruntime(less deserving) are placed toward the right side.According to the properties of a binary search tree, the leftmost node in the entire tree is guaranteed to be the node with the smallest key.
Therefore, the leftmost node in the red-black tree is the task with the smallest
vruntimeand thus the highest priority for CFS.Go to the provided figure: The diagram shows this red-black tree. Task
T0is the leftmost node and has the smallestvruntime, making it the next task to be scheduled. As you move to the right (T1,T3,T7, etc.), thevruntimevalues get larger.
How Operations Work¶
Adding a Task: When a task becomes runnable (e.g., it's newly created or unblocks from I/O), it is inserted into the tree based on its
vruntime. The tree rebalances itself if needed. This is an O(log N) operation.Removing a Task: When a task stops being runnable (it blocks, terminates, or is moved to another CPU), it is removed from the tree, which then rebalances. This is also an O(log N) operation.
Selecting the Next Task: This is the most frequent and critical operation. In theory, finding the leftmost node in a balanced tree of N nodes is an O(log N) operation. However, CFS implements a crucial performance optimization: it caches a pointer to the leftmost node in a variable called
rb_leftmost.- This means that selecting the next task to run does not require searching the tree.
- The scheduler simply dereferences the
rb_leftmostpointer, which is an O(1) - constant time operation. - The cached value is updated whenever the tree is modified (during insertions and deletions).
This combination of the red-black tree for efficient management of a dynamic task list and the rb_leftmost cache for instant selection of the next task is what makes the CFS scheduler both scalable (it works well with a huge number of tasks) and efficient (it makes scheduling decisions very quickly).
CFS Load Balancing¶
The CFS scheduler supports load balancing across multiple processors, but it uses a sophisticated approach.
- Defining "Load": In CFS, the load of a thread is not just the number of threads in a queue. It is a calculated metric that combines:
- The thread's priority.
- Its average rate of CPU utilization.
- Implication of this Definition: A high-priority thread that is mostly I/O-bound (and thus uses little CPU) will have a low load, similar to a low-priority thread. This gives a more accurate picture of CPU demand than simply counting threads.
- Queue Load: The total load of a CPU's run queue is the sum of the loads of all threads in that queue. The goal of load balancing is to make the load on all queues approximately equal.
The Challenge: The Cost of Migration¶
As discussed in Section 5.5.4, blindly moving threads between processors to balance load can be counterproductive because it can cause:
- Cache Misses: The new CPU's cache is "cold" for the migrated thread.
- NUMA Penalties: On NUMA systems, a thread may be moved to a CPU that is farther from its memory, leading to slower memory access times.
The Solution: Scheduling Domains¶
To address this, Linux CFS uses a hierarchical model of scheduling domains. A scheduling domain is a set of CPU cores that can be balanced against each other. The cores are grouped based on how they share hardware resources.
- Go to Figure 5.27: This figure illustrates the hierarchical scheduling domains.
- At the lowest level, each core has its own L1 cache.
- Domain0 and Domain1: Cores that share an L2 cache are grouped into a domain (e.g., core0 & core1 in domain0; core2 & core3 in domain1). Migrating a thread within this domain is relatively cheap because the cores share the L2 cache.
- Processor-Level Domain (NUMA Node): The next level up combines domains that share an L3 cache. This is often a NUMA node. Migrating a thread within this node is more expensive than within an L2 domain, but cheaper than moving between NUMA nodes.
- System-Level Domain: On a larger NUMA system, an even higher level would combine separate processor-level NUMA nodes. Migration at this level is the most expensive.
The Load Balancing Strategy¶
CFS uses a strategic, hierarchical approach to load balancing:
Balance at the Lowest Level First: The scheduler first attempts to balance load within the smallest, most tightly-coupled domain (e.g., within
domain0itself). This minimizes migration cost by keeping threads on cores that share a cache.Progress Up the Hierarchy: If a significant imbalance persists, balancing then occurs at the next level (e.g., between
domain0anddomain1).Reluctance for NUMA Migration: CFS is very reluctant to migrate threads between separate NUMA nodes. It will only do this under severe load imbalances because the performance penalty of remote memory access is so high.
General Rule: If the overall system is busy, CFS will typically not load-balance beyond the domain local to each core to avoid the memory latency penalties of NUMA systems. It prefers a slight load imbalance over a significant increase in memory access time.
This NUMA-aware strategy allows CFS to balance load effectively while intelligently managing the trade-off between CPU utilization and memory access latency.
5.7.2 Example: Windows Scheduling¶
The Core Algorithm¶
The Windows operating system uses a priority-based, preemptive scheduling algorithm. The part of the Windows kernel responsible for this is called the dispatcher.
- Fundamental Rule: The dispatcher ensures that the highest-priority runnable thread will always be the one that is running.
- A thread that is selected to run will continue until one of four things happens:
- It is preempted by a higher-priority thread.
- It terminates.
- Its time quantum (time slice) expires.
- It performs a blocking system call (e.g., for I/O).
This design guarantees that high-priority threads (especially real-time ones) get immediate access to the CPU when they need it.
The 32-Level Priority Scheme¶
The dispatcher manages threads using a 32-level priority scheme (from 0 to 31). These priorities are divided into two main classes:
- Variable Class (Priorities 1-15): This class contains most user and application threads. Their priorities can be dynamically adjusted by the OS.
- Real-Time Class (Priorities 16-31): This class contains high-priority threads. Their priorities are static and are not adjusted by the OS.
- Priority 0: Reserved for a special thread used for memory management operations.
How the dispatcher selects a thread:
- It maintains a separate queue for each of the 32 priority levels.
- It checks these queues in order from highest priority (31) to lowest (0).
- It runs the first ready thread it finds.
- If no threads are ready, it runs a special idle thread.
Process Priority Classes and Thread Relative Priorities¶
There is a mapping between the numeric kernel priorities (1-31) and what a programmer specifies using the Windows API.
A. Process Priority Classes A process belongs to one of six priority classes, which sets the general importance of all its threads:
IDLE_PRIORITY_CLASSBELOW_NORMAL_PRIORITY_CLASSNORMAL_PRIORITY_CLASS(This is the typical default)ABOVE_NORMAL_PRIORITY_CLASSHIGH_PRIORITY_CLASSREALTIME_PRIORITY_CLASS
A process's priority class can be set at creation or changed using the SetPriorityClass() API function.
B. Thread Relative Priorities Within a given priority class, an individual thread can have a relative priority that modifies its base importance. The relative priorities are:
IDLELOWESTBELOW_NORMALNORMALABOVE_NORMALHIGHESTTIME_CRITICAL
Mapping to a Numeric Priority¶
The final numeric priority (from 1 to 31) of a thread is determined by combining its Process Priority Class and its Thread Relative Priority.
- Go to Figure 5.28: This table shows the complete mapping. To find a thread's numeric priority:
- Find its Priority Class in the top row.
- Find its Relative Priority in the left column.
- The cell where the row and column intersect is the thread's numeric priority.
- Example: A thread in the
ABOVE_NORMAL_PRIORITY_CLASSwith aNORMALrelative priority has a numeric priority of 10.
Base Priority¶
Each thread has a base priority, which is the default priority value within its class.
- By default, the base priority is the value corresponding to the
NORMALrelative priority for that class. - The base priorities for the default
NORMALrelative priority are:REALTIME_PRIORITY_CLASS— 24HIGH_PRIORITY_CLASS— 13ABOVE_NORMAL_PRIORITY_CLASS— 10NORMAL_PRIORITY_CLASS— 8BELOW_NORMAL_PRIORITY_CLASS— 6IDLE_PRIORITY_CLASS— 4
A thread typically starts with the base priority of its process, but this can be changed using the SetThreadPriority() API function.
Dynamic Priority Adjustment for Variable-Class Threads¶
For threads in the variable-priority class (1-15), Windows dynamically adjusts their priority to optimize system behavior and user experience. These adjustments are made based on how the thread uses the CPU.
1. Priority Lowering (For CPU-bound Threads)
- When it happens: When a thread uses up its entire time quantum (time slice) and is interrupted.
- The action: The thread's priority is lowered.
- The Limit: The priority is never lowered below the thread's base priority.
- The Goal: This penalizes compute-bound threads that try to monopolize the CPU. By lowering their priority, they yield the CPU to other, potentially more interactive, threads.
2. Priority Boosting (For I/O-bound and Interactive Threads)
- When it happens: When a thread is released from a wait operation (e.g., after finishing an I/O request like a keypress or disk read).
- The action: The dispatcher boosts (increases) the thread's priority.
- The Amount: The boost is not uniform. It depends on what the thread was waiting for.
- A thread waiting for a keyboard or mouse event gets a large priority increase. This is because such events are directly tied to user interactivity.
- A thread waiting for a disk operation gets a moderate increase.
- The Goal:
- To give excellent response times to interactive threads. When a user clicks, the waiting thread gets a high priority and runs immediately.
- To allow I/O-bound threads to quickly issue their next I/O request, which helps keep I/O devices busy.
- This strategy effectively allows I/O-bound and interactive threads to maintain high priority, while CPU-bound threads sink to lower priorities, using only the spare CPU cycles.
This general strategy of lowering priority on time-quantum expiration and boosting it after I/O waits is common and is also used by other operating systems, including UNIX variants.
Special Foreground Process Boost¶
Windows provides an additional performance optimization for the process the user is directly interacting with.
- The Scenario: A user is running an interactive program (like a word processor or a game) in a window.
- The Distinction: Windows identifies the foreground process (the process in the currently selected window) separately from background processes.
- The Rule: When a process in the
NORMAL_PRIORITY_CLASSmoves into the foreground, Windows does not change its priority, but it increases its scheduling quantum. - The Quantum Increase: The factor is typically 3. This means the foreground process gets three times longer to run before a time-sharing preemption occurs.
- The Goal: This longer time quantum reduces the overhead of context switches for the active application, giving it a larger, more contiguous chunk of CPU time. This results in smoother performance and better responsiveness for the application the user is actively using.
Advanced Windows Scheduling Features¶
User-Mode Scheduling (UMS)¶
Introduced in Windows 7, User-Mode Scheduling (UMS) is a powerful feature that allows applications to create and manage their own threads entirely in user mode, without requiring intervention from the Windows kernel scheduler for every context switch.
- Benefit: For applications that create a massive number of threads, this is vastly more efficient. The overhead of a system call for every scheduling decision is eliminated.
- Comparison to Fibers: Earlier versions of Windows had a similar concept called fibers, where multiple user-mode threads were mapped to a single kernel thread. However, fibers had a major limitation: they all had to share a single Thread Environment Block (TEB), which is a kernel data structure. If one fiber made a Windows API call that modified the TEB, it would corrupt the state for all other fibers on that kernel thread.
- UMS Advantage: UMS overcomes this by providing each user-mode thread with its own dedicated thread context, including a separate TEB, allowing them to safely make Windows API calls.
- Intended Use: UMS is not meant to be used directly by most programmers. Writing a correct and efficient user-mode scheduler is very complex. Instead, UMS is a low-level mechanism used by programming frameworks and libraries. A prime example is Microsoft's Concurrency Runtime (ConcRT), a C++ framework for task-based parallelism. ConcRT provides a user-mode scheduler that breaks programs into tasks and schedules them efficiently across CPU cores.
Multiprocessor and SMT Scheduling¶
Windows fully supports scheduling on multiprocessor systems, incorporating the concepts from Section 5.5, such as processor affinity.
1. Logical Processors and SMT Sets
- On systems with Simultaneous Multithreading (SMT), like Intel's Hyper-Threading, each physical core appears as multiple logical processors to the OS.
- Windows groups these logical processors that belong to the same physical core into SMT sets.
- Example: A quad-core system with 2-way SMT (Hyper-Threading) has 8 logical processors. These are grouped into 4 SMT sets:
{0, 1},{2, 3},{4, 5},{6, 7}. - Scheduling Goal: To avoid the cache penalties discussed in Section 5.5.4, the Windows scheduler tries very hard to keep a thread running on logical processors within the same SMT set. Moving a thread between two logical processors on the same core is cheap because they share the caches.
2. Ideal Processor and Load Distribution To distribute work evenly across all processors, Windows uses the concept of an ideal processor for each thread.
- Ideal Processor: This is a number that identifies the thread's preferred logical processor.
- Assignment Strategy:
- Each process is given a starting seed value for its first thread's ideal processor.
- For each new thread created by that process, the ideal processor is calculated by incrementing the seed.
- On an SMT system, the increment is designed to jump to the next SMT set, not just the next logical processor. This ensures load is spread across physical cores first.
- Example: In our 8-logical-processor system, a process with a seed of 0 would assign ideal processors as: 0, 2, 4, 6, 0, 2...
- Avoiding Contention: To prevent every process from assigning its first thread to processor 0, different processes are given different initial seed values.
- Example: A second process might have a seed of 1, causing it to assign ideal processors as: 1, 3, 5, 7, 1, 3...
This sophisticated strategy ensures that thread load is automatically and evenly distributed across all physical cores in the system, minimizing hotspots and maximizing efficient use of the hardware.
5.7.3 Example: Solaris Scheduling¶
Overview: Scheduling Classes¶
Solaris employs a highly configurable, priority-based thread scheduling system. Its key feature is the use of scheduling classes. Each thread belongs to one of six distinct classes, which define its scheduling behavior:
- Time Sharing (TS) - The default class for most processes.
- Interactive (IA) - Similar to TS, but gives a priority boost to windowing applications.
- Real Time (RT) - For threads with the strictest timing requirements.
- System (SYS) - Reserved for kernel threads.
- Fair Share (FSS) - Uses CPU shares instead of priorities.
- Fixed Priority (FP) - Priorities are static and not dynamically adjusted.
Each class has its own range of priorities and can use a different scheduling algorithm.
The Time-Sharing (TS) and Interactive (IA) Classes¶
These are the default classes for user applications. They use a multilevel feedback queue algorithm that dynamically alters thread priorities to achieve a good balance between interactive response and CPU-bound throughput.
- Core Policy: There is an inverse relationship between priority and time quantum.
- High-priority threads (like interactive ones) get a small time slice. This allows them to get on the CPU quickly, do their work, and return to waiting for user input.
- Low-priority threads (like CPU-bound ones) get a large time slice. This allows them to run for longer periods once they get the CPU, improving throughput and reducing context-switch overhead.
- Go to Figure 5.29: This is the dispatch table for the TS and IA classes. It has four columns:
- Priority: The class-specific priority (0-59). A higher number means a higher priority.
- Time Quantum: The time slice assigned to a thread at this priority. Notice priority 0 has a 200 ms quantum, while priority 59 has a 20 ms quantum. This is the inverse relationship in practice.
- Time Quantum Expired: The new priority for a thread that used its entire time slice without blocking. This thread is considered CPU-intensive, so its priority is lowered (e.g., from 50 to 40) to penalize it and benefit interactive tasks.
- Return from Sleep: The new priority for a thread that is waking up from a sleep (e.g., after an I/O operation). This thread is considered interactive, so its priority is boosted to a high value (between 50 and 59) so it can run immediately, providing good response time.
The Interactive (IA) class uses the exact same dispatch table and policy as the Time-Sharing class. The difference is that it is intended for threads associated with a graphical window, and the system may assign them to a higher initial priority within this framework to ensure a responsive user interface.
The Other Scheduling Classes¶
Real Time (RT) Class
- Threads in this class have the highest possible priority in the system (aside from interrupt threads).
- They will always run before any thread from the TS, IA, SYS, FSS, or FP classes.
- This guarantees a response within a bounded time, which is essential for hard real-time tasks. It is used sparingly.
System (SYS) Class
- This class is used for kernel threads (like the scheduler itself or memory management daemons).
- The priorities of SYS threads are static; they are set when the thread is created and do not change.
- This class is reserved for the kernel; user processes running system calls are not in the SYS class.
Fixed Priority (FP) and Fair Share (FSS) Classes
- Introduced in Solaris 9.
- Fixed Priority (FP): Threads in this class have the same priority range as TS/IA, but their priorities are not dynamically adjusted by the scheduler. They remain fixed.
- Fair Share (FSS): This class does not use priorities. Instead, it allocates CPU time based on CPU shares assigned to a group of processes called a project. This ensures that each project gets its entitled fraction of the CPU resources.
Global Priorities and Final Scheduling Decision¶
Although each class has its own priorities, the Solaris scheduler combines them into a single, global priority scheme to make its final decision.
- Go to Figure 5.30: This figure shows how all class-specific priorities map onto a global scale.
- The scheduler always picks the runnable thread with the highest global priority to run.
- The thread runs until it (1) blocks, (2) uses its time slice, or (3) is preempted by a higher-priority thread.
- If multiple threads have the same global priority, they are scheduled in a round-robin fashion from a circular queue.
- Interrupt Threads: The kernel has special threads for handling interrupts. These are not in any scheduling class and execute at the very highest global priorities (160-169), ensuring hardware interrupts are serviced immediately.
Note on Threading Model: While Solaris traditionally used the many-to-many model, it switched to a one-to-one model starting with Solaris 9, meaning each user thread is mapped directly to a single kernel thread.
5.8 Algorithm Evaluation¶
The Problem: Choosing a Scheduling Algorithm¶
We have learned about many different CPU-scheduling algorithms (FCFS, SJF, Priority, Round Robin, etc.), each with its own parameters and strengths. A fundamental question is: How do we choose the best one for a particular computer system? This is a complex problem because there is no single "best" algorithm that fits all scenarios.
Step 1: Define the Evaluation Criteria¶
The first and most critical step is to define what "best" means for our specific system. We must establish clear, measurable goals based on the performance criteria from Section 5.2 (CPU utilization, throughput, turnaround time, waiting time, response time).
Our criteria will be a set of requirements that the chosen algorithm must meet. These criteria often involve trade-offs. For example, we might define our goals as:
- Criterion 1: Maximize CPU utilization, but with a strict constraint that the maximum response time for any interactive user must not exceed 300 milliseconds.
- Criterion 2: Maximize overall throughput (number of processes completed per hour), with the additional goal that the average turnaround time is linearly proportional to the total execution time of the process. This means that a job that takes twice as long to run should, on average, take twice as long to complete from submission to finish, ensuring predictability.
By defining criteria like this, we move from a vague goal like "make it fast" to a specific, testable set of requirements that we can use to compare different algorithms.
Step 2: Evaluate the Algorithms¶
Once we have our criteria, the next step is to evaluate the available scheduling algorithms against them. The following sections will describe the various methods we can use to perform this evaluation, such as deterministic modeling, simulation, and implementation.
5.8.1 Deterministic Modeling¶
What is Deterministic Modeling?¶
Deterministic Modeling is a type of analytic evaluation. It is a method where we take a specific, predetermined workload (we know exactly when each process arrives and how long its CPU burst will be) and use this to compute the exact performance of different scheduling algorithms for that one scenario.
It answers the question: "For this exact set of processes, which algorithm performs best according to a given metric (like average waiting time)?"
A Worked Example¶
Consider the following workload. All five processes arrive at time 0 in the order given:
| Process | Burst Time (ms) |
|---|---|
| P1 | 10 |
| P2 | 29 |
| P3 | 3 |
| P4 | 7 |
| P5 | 12 |
We will compare the FCFS, SJF (non-preemptive), and RR (quantum = 10 ms) algorithms based on their average waiting time.
1. First-Come, First-Served (FCFS) The schedule is simply the order of arrival: P1, P2, P3, P4, P5.
- Timeline:
P1 [0-10]->P2 [10-39]->P3 [39-42]->P4 [42-49]->P5 [49-61] - Waiting Time Calculation:
- P1: 0 ms
- P2: 10 ms (waits for P1)
- P3: 39 ms (waits for P1 + P2)
- P4: 42 ms (waits for P1 + P2 + P3)
- P5: 49 ms (waits for P1 + P2 + P3 + P4)
- Average Waiting Time: (0 + 10 + 39 + 42 + 49) / 5 = 140 / 5 = 28 ms
2. Shortest-Job-First (SJF) - Non-preemptive The schedule is in order of increasing burst time: P3, P4, P1, P5, P2.
- Timeline:
P3 [0-3]->P4 [3-10]->P1 [10-20]->P5 [20-32]->P2 [32-61] - Waiting Time Calculation:
- P1: 10 ms (waits for P3 + P4)
- P2: 32 ms (waits for P3 + P4 + P1 + P5)
- P3: 0 ms
- P4: 3 ms (waits for P3)
- P5: 20 ms (waits for P3 + P4 + P1)
- Average Waiting Time: (10 + 32 + 0 + 3 + 20) / 5 = 65 / 5 = 13 ms
3. Round Robin (RR) - Quantum = 10 ms Processes are executed in cycles. The ready queue order at start is P1, P2, P3, P4, P5.
- Timeline:
- Time 0-10: P1 runs for its full 10 ms burst and finishes.
- Time 10-20: P2 runs for 10 ms (19 ms remaining).
- Time 20-23: P3 runs for its full 3 ms burst and finishes.
- Time 23-30: P4 runs for its full 7 ms burst and finishes.
- Time 30-40: P5 runs for 10 ms (2 ms remaining).
- Time 40-50: P2 runs for another 10 ms (9 ms remaining).
- Time 50-52: P5 runs for its remaining 2 ms and finishes.
- Time 52-61: P2 runs for its final 9 ms and finishes.
- Waiting Time Calculation:
- P1: 0 ms
- P2: (10 [wait for P1] + (40-20=20) [wait after 1st run] + (52-50=2) [wait after 2nd run]) = 32 ms
- P3: 20 ms (waits for P1 and P2's first time slice)
- P4: 23 ms (waits for P1, P2's first slice, and P3)
- P5: (30 [wait for P1, P2, P3, P4] + (50-40=10) [wait after 1st run]) = 40 ms
- Average Waiting Time: (0 + 32 + 20 + 23 + 40) / 5 = 115 / 5 = 23 ms
Conclusion for this workload: SJF gives the best average waiting time (13 ms), followed by RR (23 ms), and then FCFS (28 ms).
Advantages and Disadvantages of Deterministic Modeling¶
Advantages:
- It is simple and fast.
- It provides exact numbers for a direct comparison.
- It is excellent for illustrating algorithms and proving properties (e.g., it can be mathematically proven that SJF is optimal for minimizing average waiting time when all jobs are available at time zero).
Disadvantages:
- It requires exact, predetermined input data, which is often not available in real systems where burst times are unpredictable.
- The results apply only to the specific workload used. A different workload might yield completely different results.
Primary Use: Deterministic modeling is best used for teaching concepts, providing simple examples, and analyzing closed, repetitive systems where the workload is perfectly known and constant.
5.8.2 Queueing Models¶
Moving Beyond a Single Snapshot¶
In real systems, the set of processes and their burst times are not fixed; they change constantly. Deterministic modeling is too rigid for this environment. Instead, we can use Queueing Models, which analyze system performance based on the probability distributions of key events rather than fixed values.
- The Distributions:
- CPU/I/O Burst Distribution: The frequency of different CPU burst lengths. This is often an exponential distribution, which can be described simply by its mean (average) value.
- Arrival-Time Distribution: The frequency of new processes arriving in the system.
By knowing these distributions, we can calculate average performance metrics like throughput, CPU utilization, and waiting time for various algorithms.
Modeling the System as a Network of Queues¶
In queueing theory, we model a computer system as a network of servers, each with its own queue.
- The CPU is a server, and the ready queue is its waiting line.
- Each I/O device is a server, and its device queue is the waiting line.
Knowing the arrival rate (how often processes enter a queue) and the service rate (how quickly the server can process them), we can compute important results for any queue, such as:
- Average queue length
- Average waiting time in the queue
- Server utilization
This overall study is known as queueing-network analysis.
Little's Formula¶
A fundamental and powerful result in queueing theory is Little's Formula. It states that for a stable system, the long-term average number of items in a queueing system is equal to the long-term average arrival rate multiplied by the average time an item spends in the system.
For a queue (excluding the item being serviced), it is expressed as: n = λ × W
Where:
n= average queue length (number of items waiting, not including the one being served).λ(lambda) = average arrival rate (e.g., processes per second).W= average waiting time in the queue.
Why it's powerful: Little's formula is universal. It holds for any scheduling algorithm and any arrival distribution.
Example: If we know that an average of 7 processes per second (λ) arrive at the ready queue and that there are normally 14 processes (n) in the queue (waiting, not running), we can compute the average waiting time:
- n = λ × W
- 14 = 7 × W
- W = 14 / 7 = 2 seconds
So, the average process waits in the ready queue for 2 seconds.
Limitations of Queueing Analysis¶
While powerful, queueing analysis has significant limitations:
- Mathematical Complexity: The math becomes very difficult for complicated scheduling algorithms and realistic, complex distributions.
- Oversimplified Assumptions: To make the math tractable, models often use mathematically convenient distributions (like exponential) that may not perfectly reflect real-world behavior.
- Independent Assumptions: The models often require assumptions that may not be entirely accurate, such assuming arrivals are independent of the current system state.
- Approximate Results: Because of the simplifications and assumptions, the results from a queueing model are often approximations of real-system behavior, and their accuracy can sometimes be questionable.
Despite these limitations, queueing models provide invaluable insights for understanding the general relationships between arrival rate, queue length, and waiting time, and for making high-level comparisons between scheduling strategies.
5.8.3 Simulations¶
A More Accurate Evaluation Method¶
To get a more precise and realistic evaluation of scheduling algorithms than analytical models can provide, we use simulations. A simulation involves creating a software model of a computer system.
- How it works: The programmer creates data structures that represent the key components of the system, such as the CPU, I/O devices, the ready queue, and the processes. The simulation is controlled by a variable that acts as a clock.
- Process: As the clock advances, the simulator updates the system state to reflect everything that happens: processes arriving, the scheduler making decisions, processes running on the CPU, and processes performing I/O.
- Output: Throughout the simulation, the program collects detailed statistics on performance metrics (like average waiting time, throughput, etc.) for the algorithm being tested.
Generating Input for the Simulation (The "Driver")¶
The biggest question is: what data should be used to test the system? There are two primary methods, each with trade-offs.
1. Distribution-Driven Simulation (Synthetic Input)
- How it works: A random-number generator is used to create artificial events (process arrivals, CPU burst lengths, etc.). These events are generated according to a specified probability distribution (e.g., exponential, uniform).
- Types of Distributions:
- Mathematical: Using a standard statistical distribution.
- Empirical: Using a distribution measured from a real system.
- Disadvantage: This method may be inaccurate because it only captures the frequency of events, not the order or relationships between them. Real systems often have sequences of events that are correlated, which a random distribution might not replicate.
2. Trace Tape Simulation (Real Input)
- How it works: This method uses a recorded log, or trace file, of the actual events that occurred on a real system.
- Go to Figure 5.31: This figure illustrates the process perfectly.
- A real system is monitored, and a detailed log (the "trace tape") is created, recording the exact sequence of process executions, CPU bursts, and I/O operations.
- This trace file is then used as the input to drive the simulator for different algorithms (FCFS, SJF, RR).
- Major Advantage: This is the best method for comparing algorithms because it tests them all on the exact same set of real inputs. This allows for a direct, fair, and highly accurate comparison of their performance on a realistic workload.
Disadvantages of Simulations¶
Despite being highly useful, simulations have significant costs:
- Computational Expense: Simulations can require hours of computer time, especially detailed ones.
- Detail vs. Cost Trade-off: A more detailed and accurate simulation model takes significantly more time to run.
- Storage Space: Trace files that record extensive system activity can consume large amounts of storage space.
- Development Effort: The design, coding, and debugging of the simulator itself is a complex and time-consuming software engineering task.
5.8.4 Implementation¶
The Most Accurate and Costly Method¶
The only way to know exactly how a scheduling algorithm will perform for a specific environment is to code it, install it in the real operating system kernel, and let users run their real applications on it.
- This is not a model or an approximation; it is the actual system under test.
- The major advantage is that you get to see the scheduler's performance under a completely realistic workload, including all the complex, unpredictable interactions of real user behavior and applications.
The Major Drawback: High Cost and Risk¶
While implementation provides the most valid results, it is also the most difficult and dangerous approach for several reasons:
Expense and Disruption:
- Implementing a new scheduler requires modifying the operating system kernel, which is a complex and sensitive task.
- Once implemented, the scheduler affects every process and user on the system.
- Putting an experimental, and potentially poorly performing, algorithm into a production system can lead to severe performance degradation and user dissatisfaction.
The Environment is Never Constant:
- The "best" algorithm under one workload might be the worst under another.
- Users' needs and the mix of applications change over time. An algorithm tuned for today's workload might be a poor fit for tomorrow's.
- This makes it difficult to draw permanent, general conclusions from a single implementation.
Because of these high costs and risks, new scheduling algorithms are typically thoroughly evaluated using simulations and analytic models before they are ever considered for a real implementation in a production operating system.
5.9 Summary¶
Core Concepts¶
- CPU Scheduling is the process of selecting a waiting process from the ready queue and allocating the CPU to it. The dispatcher is the module that handles the actual context switch.
- Algorithms are either preemptive (the OS can force a process off the CPU) or nonpreemptive (the process keeps the CPU until it releases it). Modern OSes are predominantly preemptive.
- The five key evaluation criteria are: CPU Utilization, Throughput, Turnaround Time, Waiting Time, and Response Time.
Fundamental Scheduling Algorithms¶
- First-Come, First-Served (FCFS): Simple, but can lead to the convoy effect, where short processes wait for one long process.
- Shortest-Job-First (SJF): Provably optimal for minimizing average waiting time. The challenge is accurately predicting the length of the next CPU burst. It can be nonpreemptive or preemptive (Shortest-Remaining-Time-First).
- Round-Robin (RR): Designed for time-sharing. Each process runs for a time quantum. It is preemptive and provides good response time. Performance depends heavily on the size of the quantum.
- Priority Scheduling: Schedules processes based on priority. Can lead to indefinite blocking (starvation) of low-priority processes, which is often solved by aging (gradually increasing a process's priority over time).
- Multilevel Queue: Partitions the ready queue into several separate queues (e.g., system, interactive, batch), each with its own scheduling algorithm and priority level.
- Multilevel Feedback Queue: Similar to multilevel queues, but allows processes to migrate between queues based on their observed behavior (e.g., a CPU-bound process may be moved to a lower-priority queue).
Multiprocessor and Real-Time Scheduling¶
- Multicore Systems: Present logical CPUs to the OS. Scheduling must consider:
- Load Balancing: Distributing work evenly across cores. This can be done via push migration (a master thread moves tasks) or pull migration (an idle core pulls a task).
- Processor Affinity: Trying to keep a thread on the same processor to benefit from a warm cache. This can be soft (best effort) or hard (mandatory).
- NUMA: On Non-Uniform Memory Access systems, scheduling a thread on a CPU close to its memory is critical for performance.
- Heterogeneous Multiprocessing (HMP): Uses cores with different performance and power characteristics (e.g., ARM's big.LITTLE) to optimize for power efficiency.
- Real-Time Scheduling:
- Soft Real-Time: No guarantees, but gives preference to real-time tasks.
- Hard Real-Time: Requires absolute deadline guarantees.
- Rate-Monotonic Scheduling (RMS): A static-priority algorithm for periodic tasks where a shorter period means a higher priority. It is optimal among static-priority schedulers but has a CPU utilization bound of
N(2^(1/N) - 1). - Earliest-Deadline-First (EDF): A dynamic-priority algorithm where the earliest deadline has the highest priority. It is theoretically optimal and can achieve 100% CPU utilization.
- Proportional Share: Allocates T shares among applications; a task with N shares is guaranteed N/T of the CPU time. Requires admission control.
Operating System Examples¶
- Linux (Completely Fair Scheduler - CFS): Aims for "fair" CPU distribution. It uses virtual run time (vruntime) to track how much CPU a task has effectively used, always scheduling the task with the smallest
vruntime. It uses a red-black tree for efficiency and is NUMA-aware. - Windows: Uses a preemptive, priority-based algorithm with 32 priority levels (1-15 variable class, 16-31 real-time class). It dynamically adjusts priorities: lowers them for CPU-bound threads and boosts them for I/O-bound and interactive threads. The foreground process gets a longer time quantum.
- Solaris: Uses six scheduling classes (Time Sharing, Interactive, Real Time, System, Fair Share, Fixed Priority). The Time-Sharing class uses a multilevel feedback queue with an inverse relationship between priority and time quantum. All class-specific priorities are mapped to a single global priority for the final scheduling decision.
Algorithm Evaluation Methods¶
- Deterministic Modeling: Uses a predetermined, specific workload for exact analysis. Simple and good for examples, but not generalizable.
- Queueing Models: Uses probability distributions for arrivals and burst times to analyze system performance mathematically. Little's Formula (n = λ × W) is a key, universal result.
- Simulations: Involves programming a model of the system. Can be driven by random distributions or, more accurately, by a trace tape of a real system's events. More accurate but computationally expensive.
- Implementation: Coding the algorithm into a real OS kernel. Provides the most valid results but is the most costly, disruptive, and risky approach.
Chapter 6: Synchronization Tools¶
Introduction to Cooperating Processes¶
A cooperating process is defined as a process that can influence, or be influenced by, other processes running within the same system. These processes work together to accomplish a common task.
There are two primary ways cooperating processes can share information:
- Directly sharing a logical address space: Both the code and data reside in the same shared memory space (like threads within a process).
- Sharing data through inter-process communication (IPC): Processes can use shared memory (a region of memory designated for common access) or message passing (sending explicit messages to each other).
The Core Problem: When processes cooperate by sharing data—especially through a shared logical address space or shared memory—allowing them concurrent (overlapping in time) access to that data can lead to data inconsistency. The final state of the shared data might become incorrect or unpredictable.
Chapter Goal: This chapter focuses on mechanisms to coordinate, or synchronize, the actions of these cooperating processes. The goal is to ensure orderly execution when they share a logical address space, thereby maintaining data consistency.
Chapter Objectives¶
By the end of this chapter, you should be able to:
- Describe the critical-section problem and illustrate a race condition.
- Explain hardware-based solutions to the critical-section problem, including:
- Memory barriers
- Compare-and-swap operations
- Atomic variables
- Demonstrate how the following software tools can solve the critical-section problem:
- Mutex locks
- Semaphores
- Monitors
- Condition variables
- Evaluate which synchronization tool is best for scenarios with low, moderate, and high levels of contention (competition for the shared resource).
6.1 Background¶
Let's review two key concepts from earlier chapters that set the stage for the synchronization problem:
Concurrent Execution (Go to Section 3.2.2): The CPU scheduler rapidly switches the CPU between processes. This interleaving means a process can be paused at any point in its instruction stream—even before finishing a critical operation—so that another process can run. From a macroscopic view, they appear to run simultaneously, but their instruction sequences are interleaved in time on a single core.
Parallel Execution (Go to Section 4.2): In a multi-core system, two or more processes can execute their instructions literally at the same time, with each stream running on a separate processing core.
The Crucial Link to Data Integrity: Both concurrent execution (interleaving on a single core) and parallel execution (simultaneous execution on multiple cores) create situations where multiple processes might read and write the same shared data at overlapping times. This overlapping access is the root cause of potential data corruption, which we will explore next.
Revisiting the Bounded-Buffer Problem¶
Recall from Chapter 3 (Go to Section 3.5) the producer-consumer problem, a classic example of cooperating processes. They share a bounded buffer in memory. Our earlier solution had a limitation: it could only hold at most BUFFER_SIZE - 1 items at a time.
Attempting a Fix with a Counter:
To allow the buffer to hold a full BUFFER_SIZE items, we introduce an integer variable count, initialized to 0.
countis incremented when the producer adds a new item.countis decremented when the consumer removes an item.
Here is the modified code:
Producer Process:
while (true) {
/* produce an item in next_produced */
while (count == BUFFER_SIZE)
; /* do nothing - wait if buffer is full */
buffer[in] = next_produced;
in = (in + 1) % BUFFER_SIZE;
count++;
}
Consumer Process:
while (true) {
while (count == 0)
; /* do nothing - wait if buffer is empty */
next_consumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
/* consume the item in next_consumed */
}
The Problem: Concurrent Execution While each routine is logically correct on its own, they may fail when executed concurrently (interleaved or in parallel).
Illustration of the Failure:
Assume count has the correct value of 5. If the producer executes count++ and the consumer executes count-- concurrently, the final value of count could be 4, 5, or 6. Only 5 is correct. This is a serious error.
Why Does This Happen? The Machine-Level View¶
The high-level statements count++ and count-- are not atomic (indivisible) at the machine level. They are each implemented as multiple instructions:
Implementation of count++:
register1 = count(Load)register1 = register1 + 1(Increment)count = register1(Store)
Implementation of count--:
register2 = count(Load)register2 = register2 - 1(Decrement)count = register2(Store)
(Note: register1 and register2 could be the same physical CPU register, but its contents are saved/restored during a context switch, as explained in Section 1.2.3).
When the producer and consumer run concurrently, these low-level instructions can be interleaved in any order by the OS scheduler. The order within each high-level statement is preserved, but the interleaving between processes is arbitrary.
Example of a Problematic Interleaving (Race Condition):
| Time | Process | Instruction Executed | Value (after instruction) |
|---|---|---|---|
| T0 | Producer | register1 = count |
register1 = 5 |
| T1 | Producer | register1 = register1 + 1 |
register1 = 6 |
| T2 | Consumer | register2 = count |
register2 = 5 (Stale read!) |
| T3 | Consumer | register2 = register2 - 1 |
register2 = 4 |
| T4 | Producer | count = register1 |
count = 6 |
| T5 | Consumer | count = register2 |
count = 4 (Final incorrect state) |
This interleaving results in count = 4, even though one item was produced and one consumed (so it should remain 5). The problem occurred because the consumer (T2) read the old value of count (5) before the producer had stored its updated value (6). The consumer's later store (T5) overwrites the producer's update.
Definition: Race Condition¶
The situation described above is called a race condition. A race condition occurs when:
- Two or more processes (or threads) concurrently access and manipulate the same shared data.
- The final outcome of the execution (the value of the data) depends on the precise, non-deterministic order in which the accesses and manipulations occur.
The Core Requirement: Synchronization
To prevent race conditions, we must synchronize the cooperating processes. We must ensure that only one process at a time can be allowed to manipulate the shared variable count (or, more generally, a shared resource or data structure).
The Importance of Synchronization in Modern Systems¶
This problem is fundamental and frequent in operating systems, where many components manipulate shared resources (e.g., CPU, memory, files). Furthermore, the rise of multicore systems has made multithreaded programming essential for performance. In these applications, multiple threads—which almost always share data—run in parallel on different cores, making the potential for race conditions and data corruption even greater.
Therefore, the study of process synchronization and coordination mechanisms is a central topic in operating systems and concurrent programming.
6.2 The Critical-Section Problem¶
Defining the Problem¶
We formalize the synchronization issue as the critical-section problem. Imagine a system with n processes: {P0, P1, ..., Pn−1}.
- Critical Section: Each process has a segment of code where it accesses and updates shared data (like the
countvariable in our producer-consumer example). - Fundamental Rule: When one process is executing in its critical section, no other process is allowed to execute in its critical section. This prevents concurrent access to shared data.
The goal is to design a protocol (a set of rules) that processes can follow to coordinate entry into their critical sections, enabling safe data sharing.
Process Structure for the Protocol¶
To implement the protocol, we structure each process's code into four distinct parts. Go to Figure 6.1 for a visual representation of this general structure.
- Entry Section: The block of code a process executes to request permission to enter its critical section. This is where the synchronization protocol is implemented (e.g., checking a flag, acquiring a lock).
- Critical Section: The segment of code where the process accesses and manipulates shared variables or resources.
- Exit Section: The block of code a process executes to signal that it is leaving its critical section, allowing other processes to enter. This "cleans up" the protocol (e.g., releasing a lock).
- Remainder Section: All the rest of the process's code that does not involve shared data.
General Code Structure:
while (true) {
/* ENTRY SECTION */
/* Request permission to enter */
/* CRITICAL SECTION */
/* Access and update shared data */
/* EXIT SECTION */
/* Signal departure from critical section */
/* REMAINDER SECTION */
/* Code that does not use shared data */
}
Requirements for a Correct Solution¶
Any valid solution to the critical-section problem must satisfy three essential conditions:
1. Mutual Exclusion:
- Definition: If process
Piis actively executing in its critical section, then no other process can be executing in its critical section. - Purpose: This is the core, non-negotiable requirement that directly prevents concurrent access and race conditions.
2. Progress:
- Definition: This requirement has two key parts: a. No Deadlock: If no process is currently in its critical section, and one or more processes want to enter, the decision about who enters next cannot be made by processes that are stuck in their remainder section (i.e., not trying to enter). Only processes actively in their entry section can participate in the selection. b. No Indefinite Postponement: This selection must happen within a finite time. The system cannot simply stall forever, refusing to let any process in.
- Purpose: Ensures that the synchronization mechanism is useful—processes that need to enter the critical section will eventually be able to do so. It prevents scenarios where all processes are stuck waiting outside their critical sections.
3. Bounded Waiting:
- Definition: There must be a fixed upper bound (a limit) on the number of times other processes are allowed to enter their critical sections after a specific process
Pihas requested entry (by reaching its entry section) and before that request is granted toPi. - Purpose: Prevents starvation. It guarantees that no process will be blocked from entering its critical section indefinitely while others repeatedly enter and exit. Every process gets a turn.
Underlying System Assumptions¶
To reason about solutions, we make two important assumptions about the system:
- Non-Zero Speed: Each process is executing at some speed greater than zero. It will eventually execute its instructions.
- No Relative Speed Assumptions: We cannot assume anything about the relative execution speeds of the
nprocesses. One process may be much faster than another, they may run at variable speeds, or a process could be suspended (put to sleep) for an arbitrary amount of time in its remainder section. A correct solution must work correctly under any possible interleaving of instructions.
In summary, the critical-section problem asks us to design entry and exit sections that, when added to processes, guarantee Mutual Exclusion, Progress, and Bounded Waiting under conditions of uncertain scheduling and timing.
Race Conditions in the Operating System Kernel¶
Many processes can execute in kernel mode (with privileged access to hardware and data) simultaneously. Therefore, the OS kernel itself is highly susceptible to race conditions.
Examples:
- Open File List: A kernel data structure that tracks all open files in the system. It must be updated every time a file is opened or closed. If two processes call
open()simultaneously, their attempts to modify this shared list can interleave, corrupting it. - Process ID Assignment (Go to Figure 6.2): This figure illustrates a concrete race condition.
- Two processes,
P0andP1, concurrently callfork()to create child processes. - The kernel uses a shared global variable,
next_available_pid, to assign the next unique Process Identifier (PID). - Without mutual exclusion, the following interleaving can occur:
- Both
P0andP1read the same value (e.g., 2615) fromnext_available_pid. - Both increment this value locally.
- Both assign PID 2615 to their respective child processes.
- The same PID is now assigned to two different processes, causing a serious error.
- Both
- Two processes,
Other vulnerable kernel data structures include those for memory allocation, process lists, and interrupt handling. Eliminating such race conditions is a core responsibility of kernel developers.
Naïve Approach: Disabling Interrupts¶
In a single-core (uniprocessor) system, a crude solution exists:
- Idea: Prevent interrupts from occurring while a process is in its critical section.
- Effect: This guarantees the process will not be preempted (context-switched out) during its critical section. The sequence of instructions modifying the shared variable will run atomically from the scheduler's perspective.
Why This Fails in Multiprocessors:
- Performance Impact: Disabling interrupts on all CPUs in a multiprocessor system requires time-consuming inter-processor communication, slowing down every entry into a critical section.
- System Function Impairment: Many critical system functions (like the system clock) rely on timer interrupts. Disabling them would break these functions.
- Limited Scope: It only prevents preemption from the scheduler. It does nothing to prevent a second process running simultaneously on a different CPU core from accessing the same shared data.
Therefore, disabling interrupts is generally not a feasible general solution for synchronization, especially in modern multiprocessor systems.
Kernel Design Approaches: Preemptive vs. Nonpreemptive¶
Operating system kernels adopt one of two general design philosophies regarding synchronization:
1. Nonpreemptive Kernel:
- Definition: A kernel where a process running in kernel mode cannot be preempted. It retains the CPU until it:
- Exits kernel mode (returns to user mode),
- Blocks (e.g., waits for I/O), or
- Voluntarily yields the CPU.
- Synchronization Implication: Only one process is active in the kernel at any time. This inherently prevents race conditions on kernel data structures, as there is no concurrency within the kernel itself. Synchronization is simpler.
2. Preemptive Kernel:
- Definition: A kernel where a process running in kernel mode can be preempted (by a higher-priority process, often via a timer interrupt).
- Synchronization Implication: Multiple processes can be concurrently active in the kernel. This creates a risk of race conditions on shared kernel data. Designing such a kernel is much more difficult, especially for Symmetric Multiprocessing (SMP) systems where kernel code can literally run in parallel on multiple cores. It requires careful use of the synchronization tools discussed in this chapter.
Why Choose a Preemptive Kernel? Despite the complexity, preemptive kernels are favored because:
- Responsiveness: They prevent a poorly behaved kernel-mode process from monopolizing the CPU for arbitrarily long periods, leading to a more responsive system for interactive and real-time tasks.
- Real-Time Suitability: They allow a real-time process to immediately preempt a less critical process, even if it's in the kernel, which is essential for meeting strict timing deadlines.
6.3 Peterson's Solution¶
Introduction and Purpose¶
Peterson's solution is a classic, purely software-based algorithm for solving the critical-section problem for two processes. It is an important intellectual exercise, but it comes with a critical caveat for modern systems:
Important Disclaimer: Due to the complexities of modern computer architecture (e.g., out-of-order execution, memory caching behavior), Peterson's solution is not guaranteed to work correctly on real hardware without special hardware support (like memory barriers, discussed later). We study it anyway because it:
- Provides a clear algorithmic description of solving the critical-section problem.
- Illustrates the logical complexities involved in satisfying all three requirements (mutual exclusion, progress, bounded waiting) in software.
Problem Scope and Setup¶
Restrictions:
- Designed for exactly two processes, labeled
P0andP1. - Uses the convention: For process
Pi, the other process isPj, wherej = 1 - i.
Shared Data Structures (Global variables both processes can access):
int turn;- Indicates whose turn it is to enter the critical section.
- If
turn == i, processPiis allowed to enter (but may still have to wait).
boolean flag[2];- An array of two boolean flags.
flag[i] = truesignals that processPiis ready and wants to enter its critical section.
The Algorithm¶
Go to Figure 6.3 for the code structure. The algorithm for process Pi (where j = 1 - i) is:
while (true) {
// ENTRY SECTION
flag[i] = true; // Step 1: I want to enter.
turn = j; // Step 2: Defer to the other process.
// Busy-Wait Loop (part of entry section)
while (flag[j] && turn == j)
; // Do nothing and wait.
// CRITICAL SECTION
/* Access and update shared data */
// EXIT SECTION
flag[i] = false; // Step 3: I'm no longer interested.
// REMAINDER SECTION
/* Other code not using shared data */
}
How It Works: A Walkthrough¶
Let's break down the entry section logic:
flag[i] = true;- Process
Pideclares its interest in entering the critical section. This is a public announcement.
- Process
turn = j;- Process
Pipolitely gives the turn to the other process (Pj). This act of "settingturnto the other process" is the key to ensuring bounded waiting and progress. - The Race for
turn: If both processes execute this line concurrently, they will both try to write to the sharedturnvariable. Since writes are not atomic, one will overwrite the other. The final saved value ofturndetermines who "won" this polite race and gave away the turn last. The process that writes second essentially determines the final value.
- Process
Busy-Wait Condition:
while (flag[j] && turn == j)- This is the waiting loop. Process
Pionly proceeds into its critical section if this condition is false. Let's examine the two parts:flag[j]is true: This means the other process (Pj) has expressed interest in entering (by executing its ownflag[j] = true).turn == j: This means it is currently the other process's (Pj's) turn (according to the last value written toturn).
- Interpretation:
Piwill wait (spin) only if the other process is interested AND it is the other process's turn. If either condition is false,Piproceeds.- If
flag[j]is false,Pjisn't trying to enter, soPican go in. - If
turn == i, then even ifPjis interested, it'sPi's turn, soPiproceeds.
- If
- This is the waiting loop. Process
Key Insight: The combination of flag and turn variables cleverly resolves contention:
- If only one process wants to enter (
flagof the other is false), it proceeds immediately. - If both want to enter, the final value of
turn(determined by the last process to executeturn = j) breaks the tie. The process that lost the race to be polite (i.e., the one whoseturn = jwas overwritten) findsturn == iand enters. The other findsturn == jand waits.
Exit Section:
flag[i] = false;signals thatPiis leaving and no longer needs the critical section, allowing the waiting process (if any) to proceed.
In the next part, we will analyze how this algorithm satisfies the three required properties.
Proof of Correctness¶
We will now prove that Peterson's algorithm satisfies the three requirements for a correct solution to the critical-section problem.
1. Proof of Mutual Exclusion¶
We prove by contradiction. Assume the opposite: that both processes, P0 and P1, are executing in their critical sections at the same time.
What must be true if both are in their critical sections?
Each process Pi only enters its critical section when its while loop condition is false. Therefore, for both to be inside, the following must hold simultaneously:
- For
P0to be in:!(flag[1] && turn == 1)must be true. - For
P1to be in:!(flag[0] && turn == 0)must be true.
Simplifying these, for both to be in their critical sections at once, at least one of the following compound conditions must be true:
- Case A:
flag[1] == false(P1 isn't interested) ORturn == 0(it's P0's turn). - Case B:
flag[0] == false(P0 isn't interested) ORturn == 1(it's P1's turn).
However, if both processes are in their critical sections, they must have first set their own flags to true (flag[0] = true and flag[1] = true). So we know flag[0] == true and flag[1] == true. This eliminates the "flag false" parts of Cases A and B.
Therefore, for our initial assumption (both in CS) to hold with both flags true, we are left with:
turn == 0must be true (from Case A, sinceflag[1]==true).turn == 1must also be true (from Case B, sinceflag[0]==true).
This is impossible. The shared variable turn cannot be both 0 and 1 at the same time. Our assumption leads to a contradiction.
Conclusion: It is impossible for both processes to be in their critical sections simultaneously. Mutual exclusion is preserved.
2. Proof of Progress¶
Progress requires that if no process is in its critical section and one or more want to enter, the selection of who enters next cannot be postponed indefinitely and must involve only those processes trying to enter.
Informal Proof:
A process Pi can be stuck, prevented from entering its critical section, only if it is trapped in its while loop. The loop condition is flag[j] && turn == j. So, for Pi to be stuck, two things must be true forever:
flag[j] == true(the other processPjis interested).turn == j(it is persistently the other process's turn).
But turn is only modified in the entry section (by the statement turn = j). If Pj is stuck in the loop, it is not modifying turn. If Pj is in its remainder section, flag[j] is false. If Pj enters its critical section, it will eventually exit and set flag[j] = false, breaking condition (1) for Pi. The only way for Pj to keep flag[j] == true and turn == j forever is if Pj itself is stuck in its while loop—which requires turn == i and flag[i] == true. This creates a cyclical dependency that cannot hold because turn is a single, shared variable.
If both processes try to enter concurrently, the race to set turn decides who goes first. The process that loses this race (whose turn = j is overwritten) finds turn == i and proceeds. The process that wins the race (sets turn last) finds turn == j and waits, but only until the first process finishes and clears its flag.
Therefore, a process wishing to enter will never be permanently blocked by a process in its remainder section, and deadlock is impossible. Progress is satisfied.
3. Proof of Bounded Waiting¶
Bounded Waiting requires that once a process Pi has expressed a desire to enter (by setting flag[i] = true), there is a limit to how many times the other process Pj can enter its critical section before Pi gets its turn.
Informal Proof:
Consider the point where Pi sets flag[i] = true. It then immediately sets turn = j (politely giving the turn away).
- Scenario 1: If
Pjis not interested (flag[j] == false),Pienters immediately. Waiting count = 0. - Scenario 2: If
Pjis also interested or becomes interested, we examineturn.- If
turn == iat this point (meaningPjsetturn = ibeforePisetturn = j), thenPiwill enter immediately. Waiting count = 0. - If
turn == j(meaningPi's write ofturn = joccurred last), thenPjwill proceed because it seesturn == jandflag[i] == true. This is the only case wherePimust wait.
- If
The Bound:
When Pj finishes its critical section, it sets flag[j] = false upon exiting. At this moment, the condition keeping Pi in its loop (flag[j] && turn == j) becomes false (because flag[j] is now false), and Pi can enter.
Crucially, for Pj to re-enter its critical section after this, it must go through its entry section again, executing turn = i. This flips turn to i, guaranteeing that Pi's waiting condition (turn == j) is now false, allowing Pi to proceed.
Thus, after Pi has set its flag to true, Pj can enter its critical section at most once before Pi is guaranteed entry. The bound is 1. Bounded waiting is satisfied.
Summary: Peterson's solution is a logically correct software algorithm that satisfies mutual exclusion, progress, and bounded waiting for two processes. Its primary educational value lies in demonstrating how to coordinate processes using shared variables, though it relies on assumptions about atomic memory operations that may not hold on real hardware.
Limitations on Modern Hardware¶
The Problem: Instruction Reordering¶
As noted, Peterson's solution may fail on modern computer architectures. The primary reason is that to maximize performance, processors (via out-of-order execution) and compilers (via code optimization) may reorder the sequence of read and write instructions.
- In Single-Threaded Programs: This reordering is safe and invisible. The hardware/compiler guarantees the final result is as if the instructions executed in program order. It's like rearranging the order of adding deposits and subtracting checks in your checkbook—the final balance is correct.
- In Multithreaded Programs with Shared Data: This reordering can become visible and disastrous. Other threads may observe memory updates in an order different from the program's original sequence, leading to inconsistent or unexpected program states.
Illustrative Example: A Broken Expectation¶
Consider two threads sharing these variables:
boolean flag = false;
int x = 0;
- Thread 1 executes:
while (!flag) ; // Wait until flag becomes true print x; // Then print x
- Thread 2 executes:
x = 100; flag = true;
Expected Behavior: Thread 1 prints 100. Thread 2 first sets x to 100, then signals Thread 1 by setting flag to true.
How Reordering Breaks It:
- Compiler/CPU Reorders Thread 2: Since the two statements in Thread 2 are independent (neither depends on the other's result), the system might execute
flag = truebeforex = 100. If this happens, Thread 1 could break out of its loop and printxbeforexis set to 100, resulting in output0. - CPU Reorders Thread 1: The processor could also reorder Thread 1's instructions, loading the value of
xinto a register before it checks the value offlag. Even if Thread 2 runs in perfect order, Thread 1 might have already cached the old value (0) ofxbefore entering the loop, and then print that cached value.
Impact on Peterson's Solution¶
Go to Figure 6.4, which illustrates the failure.
Recall Peterson's entry section for process Pi:
flag[i] = true;turn = j;
The Critical Assumption: The algorithm assumes these two store instructions execute and become visible to the other process in this exact order. However, if the hardware or compiler reorders them so that turn = j becomes visible before flag[i] = true, mutual exclusion can be violated.
Scenario from Figure 6.4:
P0setsturn = 1(butflag[0]is not yet visible astrue).P1setsturn = 0andflag[1] = true.P1now checksflag[0](which it still sees asfalse) andturn == 0. Sinceflag[0]appears false,P1enters its critical section.P0now setsflag[0] = true(delayed update). It then checksflag[1](true) andturn == 1. Sinceturnis 1,P0also enters its critical section.
Result: Both processes are now in their critical sections at the same time! Mutual exclusion is broken.
The Necessary Conclusion¶
Peterson's solution demonstrates correct synchronization logic, but it relies on a sequential consistency memory model that modern systems do not provide for performance reasons. To implement correct synchronization on real hardware, we need:
- Hardware Support: Mechanisms to prevent problematic instruction reordering and ensure certain memory updates become visible to other processors in a controlled manner.
- Proper Synchronization Tools: APIs that leverage this hardware support to provide correct and reliable mutual exclusion.
The following sections will explore these solutions, starting with low-level hardware features and building up to high-level programming abstractions.
6.4 Hardware Support for Synchronization¶
Introduction: Moving Beyond Pure Software¶
We've seen that pure software solutions like Peterson's algorithm are not reliable on modern hardware due to instruction reordering. To build correct and efficient synchronization, we need direct hardware support. This section covers three fundamental hardware instructions used to implement synchronization primitives. These low-level operations can be used directly by system programmers or as the building blocks for higher-level synchronization tools.
6.4.1 Memory Barriers (Memory Fences)¶
The Problem: Memory Models¶
The issue of instruction reordering is governed by a processor's memory model, which defines the guarantees it provides about when a write by one processor becomes visible to others.
Strongly Ordered Memory Model:
- Any modification to memory by one processor is immediately visible to all other processors.
- This simplifies reasoning but limits performance optimizations.
Weakly Ordered Memory Model (Common in Modern Architectures):
- Modifications to memory by one processor may not be immediately visible to others.
- Processors and compilers can aggressively reorder reads and writes for speed, as long as the single-threaded result is preserved.
Since memory models vary, programmers cannot make assumptions. To write correct concurrent code, we need a way to explicitly enforce ordering.
The Solution: Memory Barrier Instruction¶
A memory barrier (or memory fence) is a special hardware instruction that enforces ordering constraints on memory operations.
- What it does: When a memory barrier is executed, it guarantees that all load and store operations issued before the barrier are completed (i.e., their results are visible in main memory and to other processors) before any load or store operation issued after the barrier can begin.
- Effect: It prevents the hardware or compiler from reordering memory accesses across the barrier. It also ensures that writes performed on the current CPU are flushed to memory, making them visible to other CPUs.
Fixing the Earlier Example with Memory Barriers¶
Recall the example where Thread 2 sets x = 100; then flag = true;.
Without a barrier: The two stores could be reordered, causing Thread 1 to see flag = true while x is still 0.
Adding a memory barrier in Thread 2:
x = 100;
memory_barrier(); // Ensures x=100 is globally visible BEFORE flag=true
flag = true;
Now, the assignment to x is guaranteed to complete and be visible before the assignment to flag begins. This restores the logical order.
Adding a memory barrier in Thread 1:
Thread 1 waits for the flag: while (!flag) ;
Even after the loop ends, the CPU might have pre-fetched the old value of x into a register.
while (!flag)
;
memory_barrier(); // Ensures flag is truly loaded and checked before loading x
print x;
This barrier ensures the load of flag and the exit from the loop are fully completed before the load of x is performed, guaranteeing it sees the updated value.
Applying a Memory Barrier to Peterson's Solution¶
Referring back to Figure 6.4, the problem occurred because turn = j could become visible before flag[i] = true. To fix this, we insert a memory barrier between the two assignment statements in the entry section:
flag[i] = true;
memory_barrier(); // CRITICAL: Ensure flag[i]=true is visible before turn=j
turn = j;
while (flag[j] && turn == j)
;
// ... critical section ...
This barrier prevents the reordering that broke mutual exclusion, making Peterson's algorithm safe on architectures that provide this instruction.
Important Note: Memory barriers are very low-level, hardware-specific instructions. They are primarily used by kernel and system-level developers when implementing the core synchronization primitives (like locks) that higher-level applications will use. Application programmers typically use those higher-level primitives (e.g., mutexes) rather than memory barriers directly.
6.4.2 Hardware Instructions¶
Introduction: Atomic Hardware Primitives¶
To build efficient synchronization, we need hardware instructions that can perform a read, a test, and a write to memory as a single, indivisible (atomic) operation. These special instructions form the foundation upon which all higher-level synchronization tools (like mutexes and semaphores) are built.
Two classic abstract models for such instructions are test_and_set() and compare_and_swap(). They are implemented as single, uninterruptible machine instructions on modern CPUs (e.g., xchg on x86, ldrex/strex on ARM).
The test_and_set() Instruction¶
Go to Figure 6.5 for its definition. The test_and_set() instruction operates on a boolean value in memory.
Definition:
boolean test_and_set(boolean *target) {
boolean rv = *target; // Step 1: Read the old value
*target = true; // Step 2: Always set it to true (atomically with the read)
return rv; // Step 3: Return the OLD value
}
Key Property: Atomicity
The entire sequence—reading the old value and writing true—is executed atomically. If two CPUs execute test_and_set() on the same memory location simultaneously, the hardware guarantees they will be serialized (executed one after another in some order), and the combined read-write operations will not be interleaved.
Implementing Mutual Exclusion with test_and_set()¶
Concept:
We use a shared boolean lock variable, initialized to false.
lock == falsemeans the critical section is available.lock == truemeans the critical section is occupied.
The Algorithm (Go to Figure 6.6):
// Shared variable
boolean lock = false;
// Process Pi
do {
// ENTRY SECTION: Acquire the lock
while (test_and_set(&lock))
; // Busy-wait (spin) if lock was already true
// CRITICAL SECTION
// ... access shared data ...
// EXIT SECTION: Release the lock
lock = false;
// REMAINDER SECTION
} while (true);
How It Works Step-by-Step:¶
Entry (Acquiring the lock):
- A process calls
test_and_set(&lock). - Atomically, the instruction:
a. Reads the current value of
lock. b. Setslocktotrue. c. Returns the old value it just read. - Case 1: Lock was
false(available). The instruction returnsfalseand simultaneously setslocktotrue. Since thewhilecondition isfalse, the process exits the loop and enters its critical section. It has successfully acquired the lock. - Case 2: Lock was
true(busy). The instruction returnstrue(and also setslocktotrueagain, which is redundant). Thewhilecondition istrue, so the process continues to spin (busy-wait) until the lock becomes free.
- A process calls
Critical Section:
- The process executes its critical section code. Because
lockistrue, any other process trying to enter will gettest_and_set()returningtrueand will spin.
- The process executes its critical section code. Because
Exit (Releasing the lock):
- The process simply sets
lock = false. This makes the lock available again. - One of the waiting processes (spinning in its
whileloop) will now, on its nexttest_and_set()call, read thisfalse, setlocktotrue, and enter.
- The process simply sets
Analysis Against the Three Requirements:¶
Mutual Exclusion: YES. The atomicity of
test_and_set()is crucial. It guarantees that exactly one process will seelock == falseand enter. The first process to atomically setlocktotruewins; all others find it alreadytrueand wait.Progress: YES. When a process leaves its critical section (sets
lock = false), any process waiting in the entry section will immediately calltest_and_set(). The one that executes it next will atomically capture the lock and enter. The decision is made by the hardware arbitration of who executes the atomic instruction next—processes in the remainder section do not participate.Bounded Waiting: NO. This simple implementation does not guarantee bounded waiting. It's possible that a process could be starved indefinitely if other processes repeatedly acquire the lock just before it gets a chance to execute its
test_and_set()instruction. The waiting process has no priority; it's just one of many spinning contenders.
Summary: The test_and_set() instruction provides a simple, hardware-based way to enforce mutual exclusion with progress. However, the resulting spinlock causes busy-waiting and, in its basic form, does not prevent starvation. It's most useful in multiprocessor systems when the expected wait time for a lock is very short.
The compare_and_swap() (CAS) Instruction¶
The compare_and_swap() (CAS) instruction is a more general and powerful atomic primitive than test_and_set(). It operates on three operands.
Definition (Go to Figure 6.7):
int compare_and_swap(int *value, int expected, int new_value) {
int temp = *value; // Step 1: Atomically read current value
if (*value == expected) // Step 2: Compare (still part of atomic op)
*value = new_value; // Step 3: Conditionally update
return temp; // Return the ORIGINAL value read in Step 1
}
How it works atomically:
- It reads the current contents of memory location
*value. - It compares this read value to the
expectedargument. - If and only if they match, it writes
new_valueinto the location. - It always returns the original value it read in step 1, regardless of whether the swap occurred.
The entire sequence is executed as a single, uninterruptible hardware instruction.
Basic Mutual Exclusion Using CAS¶
Concept: Use a shared integer lock, initialized to 0.
lock == 0means the critical section is available.lock == 1means the critical section is occupied.
The Algorithm (Go to Figure 6.8):
// Shared variable
int lock = 0;
// Process Pi
while (true) {
// ENTRY SECTION
while (compare_and_swap(&lock, 0, 1) != 0)
; // Spin while the returned original value is NOT 0
// CRITICAL SECTION
// ... access shared data ...
// EXIT SECTION
lock = 0;
// REMAINDER SECTION
}
Walkthrough:
- A process tries to enter by calling
compare_and_swap(&lock, 0, 1).- If
lockis currently0(expected), the CAS atomically swaps it to1and returns the original value0. The process exits the spin loop and enters its critical section. - If
lockis currently1, the CAS findslock (1) != expected (0), so it does not changelock(it stays1), but returns the original value1. Since the return value is!= 0, the process continues to spin.
- If
- Upon exit, the process simply resets
lock = 0.
Analysis: This simple CAS lock has the same properties as the simple test_and_set() lock: it ensures Mutual Exclusion and Progress, but does NOT satisfy Bounded Waiting (starvation is possible).
Achieving Bounded Waiting with CAS¶
Go to Figure 6.9 for a more sophisticated algorithm that uses CAS to satisfy all three requirements, including bounded waiting, for n processes.
Shared Data Structures:
boolean waiting[n]; // Each process has its own flag, initialized to false
int lock = 0; // Global lock, initialized to 0
Algorithm for Process Pi:
while (true) {
// ENTRY SECTION
waiting[i] = true; // I'm waiting to enter
int key = 1;
// Spin until: I'm not waiting OR I successfully acquire the lock
while (waiting[i] && key == 1)
key = compare_and_swap(&lock, 0, 1); // Try to atomically grab the lock
waiting[i] = false; // I'm no longer waiting (I'm entering)
// CRITICAL SECTION
// ... access shared data ...
// EXIT SECTION (Finds next waiter in round-robin order)
int j = (i + 1) % n; // Start scanning after me
while (j != i && !waiting[j]) // Find the next waiting process
j = (j + 1) % n;
if (j == i) // Case 1: No one is waiting
lock = 0; // Just release the global lock
else // Case 2: Process j is waiting
waiting[j] = false; // Directly pass the lock to j
// REMAINDER SECTION
}
How It Satisfies the Requirements:
Mutual Exclusion: The
lockvariable ensures it. Only the first process to succeed in the CAS setslock=1. That process proceeds only whenwaiting[i]becomes false (which happens after it acquires the lock). Others spin because theirkeyremains1(CAS fails whilelock==1).Progress: When a process exits:
- If no one is waiting, it sets
lock=0, allowing the next process that arrives to acquire it via CAS. - If someone is waiting, it directly sets that waiter's flag to
false. This immediately allows that specific waiting process to exit its spin loop (waiting[j]is now false) and enter the critical section. The lock transfer is explicit and immediate.
- If no one is waiting, it sets
Bounded Waiting (The Key Innovation): The exit section implements a round-robin scan. A departing process searches the
waitingarray in a cyclic order starting from the next process. It grants the lock to the first waiting process it finds. Therefore, any waiting process will be granted the lock no later than after every other process has entered at most once. The bound isn-1turns.
Summary: The compare_and_swap() instruction is a versatile building block. It can be used to implement simple spinlocks or, with additional data structures like a waiting array, to build sophisticated, fair locks that prevent starvation. This demonstrates how hardware atomic instructions serve as the foundation for all higher-level synchronization constructs.
Making CAS Atomic in Hardware¶
Implementation on Intel x86:
The CAS functionality is provided by the cmpxchg (compare and exchange) assembly instruction. To ensure it is atomic across multiple CPUs, it is prefixed with the lock instruction.
Format:
assembly
lock cmpxchg <destination operand>, <source operand>
The lock prefix locks the system bus (or uses a cache-coherence protocol) for the duration of the operation, preventing any other CPU from accessing the destination memory location. This guarantees the entire read-modify-write cycle is seen by all processors as atomic.
Summary: CAS is a versatile atomic primitive. Its basic use creates a simple spinlock. With additional logic (like a waiting array and round-robin scanning), it can be used to build sophisticated, fair synchronization mechanisms that satisfy all critical-section requirements.
6.4.3 Atomic Variables¶
Introduction: A Higher-Level Abstraction¶
The compare_and_swap() (CAS) instruction is powerful but low-level. Programmers rarely use it directly. Instead, it serves as the fundamental building block for more convenient synchronization tools. The first such tool we examine is the atomic variable.
Purpose: An atomic variable provides a set of atomic (indivisible) operations on basic data types (integers, booleans). Its primary use is to prevent race conditions that involve updates to a single shared variable, like incrementing a counter (count++).
Recall from Section 6.1: Our producer-consumer example failed because count++ was not atomic—it involved separate read, modify, and write steps. An atomic integer would solve that specific race condition on the count variable itself.
How Atomic Variables Work¶
Systems that support atomic variables provide:
- Special Atomic Data Types (e.g.,
atomic_int,atomic_bool). - Associated Functions to operate on them (e.g.,
increment(),load(),store(),compare_exchange_strong()).
Under the Hood: These functions are implemented using hardware atomic instructions like compare_and_swap() (CAS).
Example: Implementing increment() with CAS
The text shows how to build an atomic increment operation from CAS. The logic is a common pattern known as a CAS loop or optimistic concurrency control.
void increment(atomic_int *v) {
int temp;
do {
temp = *v; // Read the current value of the atomic variable
// Try to swap. If *v hasn't changed since we read it (== temp),
// set it to temp+1. If it has changed, the swap fails, and we retry.
} while (temp != compare_and_swap(v, temp, temp + 1));
}
How the CAS Loop Works:
- Read the current value of the shared atomic variable into a local
temp. - Use CAS to attempt an update: "If
*vstill equals mytemp, set it totemp+1." - CAS atomically checks and updates. If it succeeds (returns the original value which equals
temp), the loop ends—the increment is done. - If CAS fails (returns a different value because another thread changed
*vin the meantime), the loop retries from step 1 with the new value.
This ensures the increment happens atomically, without a separate lock. The process is "optimistic": it proceeds assuming no conflict, and only retries if a conflict is detected by the CAS.
Important Limitations of Atomic Variables¶
Critical Caveat: Atomic variables do NOT solve all synchronization problems. They only guarantee atomicity for operations on that single variable. They do not provide mutual exclusion for compound operations or coordination between multiple variables.
Illustration with the Bounded-Buffer Problem:
Suppose we correctly make count an atomic integer. This ensures that count++ and count-- are atomic. However, the race condition is not fully solved.
Problem Scenario:
- Buffer is empty.
count == 0. - Two consumers,
C1andC2, both execute theirwhile (count == 0)loop. - A producer adds one item and performs
increment(&count).countis now atomically set to1. - Both
C1andC2can now seecount == 1and exit theirwhileloops simultaneously. - Both proceed to consume the same single item, leading to data corruption or an attempt to consume from an empty buffer.
The Root Cause: The atomic variable protects the update of count, but the check-then-act sequence (while (count == 0) followed by count-- and buffer access) is not atomic as a whole. The consumers make a decision based on count and then act on it, but another thread can intervene between the check and the act.
When to Use Atomic Variables¶
Atomic variables are ideal for specific, simple scenarios:
- Shared counters (e.g., statistics, sequence number generators).
- Status flags (e.g., a boolean
shutdown_requestedflag). - Simple bitmasks.
They are not sufficient for coordinating access to complex data structures (like a bounded buffer queue) or for enforcing invariants that depend on multiple shared variables. For those, you need the more robust synchronization tools discussed next: mutex locks, semaphores, and monitors, which can guard entire sections of code and coordinate actions between multiple threads. Atomic variables are a valuable tool in the concurrency toolkit, but they are just one part of the solution.
6.5 Mutex Locks¶
Introduction: Higher-Level Synchronization Tools¶
The hardware instructions (test_and_set(), compare_and_swap()) are powerful but low-level and complex to use directly. To make synchronization accessible to application programmers, operating systems provide higher-level software abstractions. The simplest and most fundamental of these is the mutex lock (short for "mutual exclusion lock").
Purpose: A mutex lock is used to protect critical sections and thereby prevent race conditions. The rule is simple:
- A process must acquire the lock before entering its critical section.
- It releases the lock after exiting its critical section.
General Structure (Go to Figure 6.10):
while (true) {
acquire_lock(); // ENTRY SECTION: Get the lock
// CRITICAL SECTION
release_lock(); // EXIT SECTION: Release the lock
// REMAINDER SECTION
}
Basic Mutex Lock Implementation¶
A mutex lock is represented internally by a boolean variable, typically named available (or locked).
available == true: The lock is free, no one holds it.available == false: The lock is busy, held by some process.
The acquire() function:
acquire() {
while (!available)
; /* BUSY WAIT - spin until lock becomes true */
available = false; // Atomically take the lock
}
The release() function:
release() {
available = true; // Atomically release the lock
}
Critical Requirement: The entire acquire() function (the check of available and the subsequent setting to false) must be executed atomically. If not, two processes could both see available == true simultaneously and both believe they acquired the lock. This atomicity is achieved by implementing these functions using the hardware atomic instructions (like compare_and_swap()) discussed in Section 6.4.
The Busy-Waiting Problem and Spinlocks¶
The simple mutex implementation shown above has a major drawback: busy waiting.
- What is Busy Waiting? If a process finds the lock unavailable (
available == false), it enters a tight loop (while (!available) ;), continuously checking the lock's status. This is called spinning. - Why it's Bad (especially on single-core systems): Busy waiting wastes CPU cycles. While one process holds the lock in its critical section, another process (or other ready processes) could be using the CPU productively. On a single CPU, the spinning process prevents the lock-holder from being scheduled to finish and release the lock, potentially leading to a deadlock scenario.
Because of this spinning behavior, a mutex lock implemented this way is specifically called a spinlock.
When Are Spinlocks Actually Useful?¶
Despite the waste, spinlocks have a crucial advantage: They avoid a context switch.
- Context Switch Cost: Switching a process off the CPU (saving its state, loading another's) is a relatively expensive operation.
- The Trade-off: If the expected wait time for a lock is very short (shorter than the time it would take to do a context switch), then it is more efficient to have the waiting process spin briefly than to incur the overhead of putting it to sleep and later waking it up.
- Ideal Use Case: Multicore systems, where one thread can spin on Core A while the lock holder executes its short critical section on Core B. The spinning thread will see the lock released almost immediately without any expensive OS intervention.
Conclusion: Spinlocks are widely used in operating system kernels for protecting short-duration critical sections, especially on multiprocessor systems. For user-level applications where hold times may be longer, alternative mechanisms that avoid busy waiting (like those putting the process to sleep) are preferred.
Looking Ahead¶
- The next section (6.6) introduces synchronization tools (like semaphores) that avoid busy waiting by putting waiting processes into a sleep state.
- Chapter 7 will show practical applications of mutex locks to solve classical synchronization problems.
- Mutex locks and spinlocks are foundational constructs used in real systems like Linux, Windows, and the Pthreads (POSIX threads) library.
In summary: A mutex lock is a software abstraction that provides mutual exclusion. Its simplest implementation is a spinlock, which is efficient for short waits on multiprocessors but wasteful for long waits. The atomicity of the lock operations is guaranteed by underlying hardware instructions.
Lock Contention¶
Contention refers to competition for a lock.
- Uncontended Lock: A lock is uncontended when a thread tries to
acquire()it and it is immediately available. This is the ideal, low-overhead case. - Contended Lock: A lock is contended when a thread tries to
acquire()it and finds it already held, forcing the thread to wait (either by spinning or blocking).- Low Contention: Few threads compete for the lock.
- High Contention: Many threads frequently compete for the lock. High contention is a major source of performance degradation in concurrent programs, as threads spend most of their time waiting.
Defining "Short Duration" for Spinlocks¶
The guideline for using a spinlock is that the lock should be held for a "short duration." Quantitatively, this means: The expected time the lock will be held should be LESS than the time it would take to perform two context switches. If the hold time is longer, the wasted CPU cycles from spinning outweigh the context-switch overhead, and a blocking lock (which puts waiting threads to sleep) is a better choice. The next section (6.6) explores locks that avoid busy-waiting.
6.6 Semaphores¶
Introduction: A More Robust Synchronization Tool¶
Mutex locks enforce simple mutual exclusion. Semaphores are a more general and powerful synchronization primitive introduced by Edsger Dijkstra. A semaphore is a shared integer variable that can only be manipulated via two atomic operations.
Definitions (using Dijkstra's original terminology):
wait(S)(also calledP(S)from Dutch proberen, "to test"): Decrements the semaphore value. If the value becomes negative, the process executingwait()must wait.signal(S)(also calledV(S)from Dutch verhogen, "to increment"): Increments the semaphore value. If any processes are waiting, one may be resumed.
Crucial Requirement: The operations on the integer S—especially the test and decrement in wait()—must be executed atomically. No two processes can execute wait() or signal() on the same semaphore simultaneously. Their implementation relies on lower-level hardware atomic instructions (see Section 6.6.2).
6.6.1 Semaphore Usage¶
Types of Semaphores¶
- Binary Semaphore:
- Integer value range: 0 or 1.
- Functionally equivalent to a mutex lock.
wait()acquires the lock;signal()releases it.
- Counting Semaphore:
- Integer value can range over any non-negative integer.
- Used to manage access to a pool of identical resources with a finite number of instances.
Using Counting Semaphores for Resource Management¶
Scenario: A system has N identical instances of a resource (e.g., 5 tape drives, 3 printer ports).
Implementation:
- Initialize the semaphore
StoN(the total number of available resources). - Each process must:
wait(S)before using a resource. This atomically decrementsS. IfS > 0, a resource is available and the process proceeds. IfS == 0, all resources are in use, and the process must block (or busy-wait in the simple definition) until one becomes free.- Use the resource (its critical section).
signal(S)after finishing with the resource. This atomically incrementsS, signaling that a resource has been returned to the pool.
Example: With N=3, S starts at 3. The first three processes calling wait(S) will get S=2,1,0 and proceed. A fourth process calling wait(S) will find S <= 0 and must wait. When a process calls signal(S), S becomes 1, and a waiting process can be awakened.
Using Semaphores for Process Synchronization (Ordering)¶
Semaphores can enforce execution order between processes, not just mutual exclusion.
Problem: Ensure statement S2 in process P2 executes only after statement S1 in process P1 has completed.
Solution:
- Create a semaphore
synchand initialize it to 0. - Process P1:
S1; // Execute the first statement signal(synch); // Signal that S1 is done (increments synch to 1)
- Process P2:
wait(synch); // Wait for the signal from P1. If synch=0, block. S2; // Execute only after P1's signal
How it works:
- If
P2reacheswait(synch)beforeP1executessignal(synch), the semaphore value is 0.P2will block (wait). - When
P1later executesS1followed bysignal(synch), it incrementssynchto 1. This unblocksP2, allowingS2to run. - If
P1signals first,synchbecomes 1. WhenP2later callswait(synch), it findssynch > 0, decrements it to 0, and proceeds immediately without blocking.
This pattern is fundamental for coordinating tasks (e.g., producer-consumer, where the consumer must wait for the producer to create data).
6.6.2 Semaphore Implementation¶
Eliminating Busy-Waiting: The Blocking Semaphore¶
The simple definitions of wait() and signal() presented earlier suffered from busy-waiting, just like the basic spinlock. To solve this, we modify the implementation so that a process blocks (suspends itself) when it must wait, allowing the CPU to be used by other processes.
The Semaphore Data Structure¶
A semaphore is now defined as a record (struct) containing:
value: An integer (which can now be negative).list: A queue (linked list) of processes (or threads) blocked and waiting on this semaphore.
typedef struct {
int value;
struct process *list; // Pointer to the head of a waiting queue
} semaphore;
Important Interpretation of value:
value >= 0: The number of available resource instances (for counting semaphores) or indicates the lock is free (for binary semaphores).value < 0: The magnitude (absolute value) ofvalueindicates the number of processes currently blocked in the semaphore's waiting queue. (e.g.,value = -3means 3 processes are waiting.)
Blocking Implementation of wait()¶
wait(semaphore *S) {
S->value--; // 1. Decrement the value first.
if (S->value < 0) { // 2. Check if we need to block.
// No resources available or lock is busy.
add this process to S->list; // 3. Enqueue itself on the wait list.
sleep(); // 4. Block: Process state -> WAITING.
// CPU scheduler picks a new process.
}
}
Step-by-Step Logic:
- Atomically decrement
value. This "takes a ticket" or "attempts to acquire." - Check the result.
- If
value >= 0after decrement, the process successfully acquired a resource/the lock and proceeds to its critical section immediately. - If
value < 0after decrement, it means no resource was available or the decrement made the count negative. The process must block.
- If
- Blocking: The process adds its own Process Control Block (PCB) to the semaphore's waiting queue and calls
sleep(). This is a system call that changes the process state from RUNNING to WAITING (or BLOCKED) and invokes the scheduler to run a different process.
Blocking Implementation of signal()¶
signal(semaphore *S) {
S->value++; // 1. Increment the value.
if (S->value <= 0) { // 2. Check if processes are waiting.
// There are blocked processes (value was -1 or less before increment).
remove a process P from S->list; // 3. Dequeue a waiting process.
wakeup(P); // 4. Wake it: Process state -> READY.
}
}
Step-by-Step Logic:
- Atomically increment
value. This "returns a resource" or "releases the lock." - Check if processes are waiting.
- If
value <= 0, it means before the increment,valuewas<= -1. This indicates there is at least one process waiting. The increment only raised the count (e.g., from -2 to -1), but it's still negative, so waiters remain. - If
value > 0after the increment, it means no processes were waiting (the queue was empty).
- If
- Wake-up: If processes are waiting, remove one process from the waiting queue (the choice depends on queue policy, e.g., FIFO for fairness). Call
wakeup(P), a system call that moves that process from the WAITING state to the READY state, placing it in the ready queue for the scheduler.
Key Implementation Details¶
Atomicity of Operations: The entire
wait()andsignal()functions (the decrement/increment plus the check) must be atomic. This is itself a critical-section problem.- On a uniprocessor: Can be implemented by disabling interrupts during the operation to prevent context switches.
- On a multiprocessor (SMP): Disabling interrupts on all cores is impractical. The atomicity must be ensured using lower-level spinlocks or atomic hardware instructions (like
compare_and_swap()) within thewait()/signal()implementation itself. This means short busy-waiting occurs inside the semaphore implementation, but it's confined to a few instructions.
Waiting Queue Management: The waiting list is typically a FIFO queue (to ensure bounded waiting and prevent starvation), but other policies are possible. The queue is managed via pointers in the Process Control Block (PCB).
Why This is Efficient: Minimizing Busy-Waiting¶
We have not eliminated busy-waiting entirely, but we have contained and minimized it.
- Busy-waiting is now confined to the very short critical sections inside
wait()andsignal()themselves, which are protected by a low-level spinlock or interrupt disabling. These sections are only a few instructions long. - Application processes no longer busy-wait. They block and free the CPU if the semaphore is unavailable. This is crucial when the wait time is long (minutes/hours) or critical sections are large.
Contrast with Spinlocks: A spinlock forces every waiting process to busy-wait for the entire duration of the holder's critical section. With blocking semaphores, only the internal implementation lock might involve a brief spin, while user processes sleep peacefully.
This efficient, blocking implementation is the standard semaphore used in practice.
6.7 Monitors¶
The Problem: Correct but Error-Prone Synchronization¶
While semaphores and mutex locks are powerful tools, they are low-level and error-prone. Their correct use relies entirely on the programmer's discipline to:
- Call
wait()(oracquire()) before entering a critical section. - Call
signal()(orrelease()) after exiting the critical section. - Never make a mistake in the order or number of these calls.
Because these operations are called explicitly from the general code, it's easy to make subtle, hard-to-detect bugs.
Illustration of Common Semaphore/Mutex Errors¶
Consider the standard pattern for mutual exclusion using a binary semaphore mutex initialized to 1:
Correct Pattern:
wait(mutex); // ENTRY
// CRITICAL SECTION
signal(mutex); // EXIT
Potential Catastrophic Errors:
Reversed Order:
signal(mutex); // WRONG: Signals without holding! // CRITICAL SECTION wait(mutex); // WRONG: Tries to acquire after!
Result: Mutual exclusion is completely violated. Multiple processes can enter the critical section concurrently. This bug may only surface rarely under specific interleavings, making it a heisenbug—hard to reproduce and detect.
Double
wait()(Deadlock):wait(mutex); // CRITICAL SECTION wait(mutex); // WRONG: Waits again on the same lock!
Result: The process deadlocks itself permanently on the second
wait(), because it is waiting for a lock it already holds and will never release.Omitted Operations:
- Omit
wait(mutex): ➜ Mutual exclusion is violated. - Omit
signal(mutex): ➜ The lock is never released, causing all other processes to permanently block (deadlock) when they try to acquire it. - Omit both: ➜ Chaos (no synchronization at all).
- Omit
These errors can arise from simple typos, copy-paste mistakes, complex control flow (like early returns in the critical section), or exception handling.
The Solution Idea: High-Level Language Constructs¶
The fundamental issue is that synchronization is separate from the resource it protects. The compiler and runtime system have no way of knowing which shared data a particular mutex is supposed to guard.
The Strategy: To prevent these errors, we can embed synchronization directly into the programming language. The language construct should:
- Encapsulate the shared data and the synchronization that protects it into a single, indivisible unit.
- Automatically enforce the correct acquisition and release of locks, making it impossible for a programmer to forget or misorder the calls.
Introducing the Monitor¶
The monitor is a high-level synchronization abstraction provided by certain programming languages (like Java synchronized methods, or Mesa/Cedar). It directly addresses the problems above.
A monitor is a collection of:
- Shared Data Variables: The resources or state that need to be protected.
- Procedures (or Methods) that operate on that shared data.
- An Implicit Lock (Mutex): Every monitor has an associated lock, but it's managed automatically.
- Initialization Code: To set up the shared data.
Key Enforcement Rule (The Big Idea):
- Only one process/thread can be active (executing any procedure) inside the monitor at any given time. The language runtime automatically acquires the monitor's lock when a thread enters any of its procedures and releases it when the thread leaves.
- This makes the entire set of procedures implicitly mutually exclusive. The programmer cannot forget to acquire or release the lock—it's handled by the language.
Result: The simple, error-prone pattern:
acquire(lock);
modify_shared_data();
release(lock);
is replaced by a structured, safer pattern:
monitor Resource {
private shared_data;
public procedure modify() {
// Compiler/runtime automatically acquires lock here.
// ... safely modify shared_data ...
// Lock is automatically released upon return.
}
}
By bundling the data and the operations together and letting the language manage the lock, monitors eliminate entire classes of synchronization bugs. However, monitors introduce a new need: a way for threads to wait for specific conditions (e.g., a consumer waiting for a buffer to be non-empty). This leads to the companion concept of condition variables, which we will explore next.
6.7.1 Monitor Usage¶
Monitors as an Abstract Data Type (ADT)¶
A monitor is essentially a specialized Abstract Data Type (ADT).
- A regular ADT encapsulates data and the operations (functions) on that data, hiding the implementation details.
- A monitor type does this and additionally guarantees mutual exclusion for all its operations. It's an ADT with built-in synchronization.
Syntax (Go to Figure 6.11):
The monitor defines:
- Private/Shared Variables: The internal state that needs protection.
- Public Procedures/Functions (
P1,P2, ...Pn): The only way to access or modify the shared variables. These are the monitor's interface. - Initialization Code: Runs once to set up the initial state.
Crucial Access Rule:
- The shared variables are only accessible from within the monitor's procedures. External processes cannot touch them directly.
- The monitor's procedures can only access their own local variables and parameters.
Automatic Mutual Exclusion: The Core Benefit¶
Go to Figure 6.12 for a schematic view. The monitor runtime enforces a fundamental rule: Only one process/thread can be actively executing inside any of the monitor's procedures at any time.
How it works:
- When a process calls a monitor procedure, the language runtime implicitly (and atomically) acquires the monitor's lock before the procedure code runs.
- When the process returns from the procedure, the runtime implicitly releases the lock.
- If a second process calls a monitor procedure while the first is inside, it is **blocked and placed on an *entry queue*** until the first process leaves and the lock is free.
Result: The programmer does not write any acquire() or release() calls. The mutual exclusion is automatic and impossible to forget, eliminating a major source of bugs.
The Need for Condition Variables¶
Automatic mutual exclusion isn't enough. Often, a thread needs to wait for a specific condition to become true inside the monitor (e.g., a consumer waiting for a buffer to be non-empty). It must release the monitor lock while waiting so other threads (like a producer) can enter and make the condition true.
This is where condition variables come in. A condition variable is a queue associated with a logical condition.
Declaration:
condition x, y;
Operations:
x.wait(): When a thread callsx.wait(), it:- Releases the monitor lock (so others can enter).
- Blocks itself and is placed in the queue for condition
x.
x.signal(): When a thread callsx.signal():- If there are threads waiting on condition
x, one is removed from the queue and resumed. - If the queue is empty, the call has no effect (unlike a semaphore
signal(), which changes a state).
- If there are threads waiting on condition
The Signaling Dilemma: What Happens After a signal()?¶
Scenario: Process P is executing inside the monitor. It calls x.signal(), which wakes up a suspended process Q. Now two processes (P and Q) are logically active inside the same monitor, which violates the "only one active" rule. How is this resolved?
Two main semantic choices exist:
Signal and Wait (Hoare Semantics):
- The signaling process
Pimmediately blocks itself and waits (perhaps on a separate, temporary queue). - The awakened process
Qtakes over the monitor lock immediately and runs. - When
Qlater exits the monitor or waits on another condition,Pis re-activated and reclaims the lock. - Advantage: When
Qresumes, the condition it was waiting for is guaranteed to be true becausePhasn't had a chance to change the state after signaling.
- The signaling process
Signal and Continue (Mesa Semantics):
- The signaling process
Pcontinues executing inside the monitor. - The awakened process
Qis moved from the condition queue to the entry queue and must re-acquire the monitor lock when it becomes free (i.e., afterPand any other queued processes leave). - Implication: By the time
Qfinally re-enters the monitor, the condition it was waiting for may no longer be true (becausePor another process could have changed the state). Therefore,Qmust re-check the condition in a loop (while (condition_not_met) x.wait();). - This is the more common approach (used in Java, C#) because it's easier to implement and often more efficient.
- The signaling process
A Compromise (Brinch Hansen Semantics):
- The signaling process
Pmust leave the monitor immediately after thesignal()call (i.e.,signal()must be its last operation in the procedure). - Then
Qcan be resumed immediately. This is a restricted form of Signal and Wait.
Monitor with Condition Variables Schematic¶
Go to Figure 6.13. This shows the full picture:
- Entry Queue: For threads waiting to enter the monitor.
- Condition Queues (
x,y): Separate queues for threads that have calledx.wait()ory.wait(). - Only one thread is active inside the monitor rectangle at any time.
Real-World Adoption¶
Monitors, often with Mesa semantics, are widely implemented:
- Java: Every object has an intrinsic monitor lock.
synchronizedmethods/blocks provide mutual exclusion. Thewait(),notify(), andnotifyAll()methods of theObjectclass correspond towait()andsignal()on condition variables. - C#: The
lockstatement andMonitor.Wait(),Monitor.Pulse()(signal),Monitor.PulseAll()methods. - Other Languages: Languages like Erlang use similar actor-based concurrency models for safe message passing.
Monitors represent a significant step towards safer, structured concurrent programming by reducing the burden of correct lock management.
6.7.2 Implementing a Monitor Using Semaphores¶
Goal: Building Monitors from Lower-Level Primitives¶
We now show how a monitor (with signal-and-wait semantics) can be constructed using the synchronization tools we already have: semaphores. This demonstrates that monitors are a higher-level abstraction that can be implemented using lower-level mechanisms.
Core Components for Mutual Exclusion¶
For each monitor, we need:
Binary Semaphore
mutex(initialized to 1): Enforces the fundamental rule: "Only one process active in the monitor." A process must callwait(mutex)to enter andsignal(mutex)to exit.Binary Semaphore
next(initialized to 0): Used to coordinate between a signaling process and the awakened process. The signaling process suspends itself onnextif it must wait for the awakened process to finish.Integer
next_count: Counts the number of processes currently suspended on thenextsemaphore.
Implementing Monitor Procedures (Functions)¶
Every monitor procedure F is translated into a structure that manages the lock and the hand-off to processes woken up by signal().
Skeleton of Translated Procedure F:
wait(mutex); // ENTRY: Acquire the monitor lock.
// ... body of the original function F ...
// EXIT: Before leaving, check if there are processes
// that were signaled and are waiting on 'next'.
if (next_count > 0)
signal(next); // Let one of them proceed.
else
signal(mutex); // Otherwise, just free the monitor lock.
Explanation: The exit code ensures that if a process was awakened via a signal() and is waiting on next (because of signal-and-wait semantics), it gets to run before the signaling process fully releases the monitor. Otherwise, the monitor lock is simply released.
Implementing Condition Variables¶
For each condition variable x in the monitor, we need:
- Binary Semaphore
x_sem(initialized to 0): The queue where processes block when they executex.wait(). - Integer
x_count(initialized to 0): The number of processes currently waiting on conditionx.
Implementing x.wait():
// Process executes this when it needs to wait for condition x.
x_count++; // 1. Increment the waiters count.
// 2. Release the monitor. Who gets it next?
if (next_count > 0)
signal(next); // Give it to a signaled process waiting on 'next'.
else
signal(mutex); // Otherwise, just release the monitor lock.
wait(x_sem); // 3. BLOCK: Suspend on the condition's queue.
x_count--; // 4. When resumed, we're about to re-enter. Decrement count.
Logic: The process adds itself to the waiters for x, then carefully releases the monitor lock, giving priority to any process already waiting on next. Finally, it blocks on x_sem. When later awakened, it proceeds, having already been granted re-entry to the monitor.
Implementing x.signal():
// Process executes this to wake up one waiter on condition x.
if (x_count > 0) { // 1. Are there any waiters?
next_count++; // 2. Yes. Increment count of processes waiting on 'next'.
signal(x_sem); // 3. Wake up ONE process waiting on condition x.
wait(next); // 4. SIGNAL-AND-WAIT: Suspend ourselves on 'next'.
next_count--; // 5. When we resume, we have the lock back. Clean up.
}
// If x_count == 0, signal() has no effect.
Logic: If there's a waiter, the signaling process:
- Notes that it will soon wait on
next. - Wakes up a waiter (on
x_sem). - Immediately suspends itself on
next, handing control directly to the awakened process. This is the signal-and-wait behavior. - When the awakened process eventually exits the monitor or waits again, it will
signal(next), resuming the original signaling process.
Putting It All Together: The Flow of Control¶
Imagine Monitor M with condition x.
- Process
P1enters (wait(mutex)), runs, finds a condition false, callsx.wait().P1blocks onx_sem. The monitor lock is released (viasignal(mutex)).
- Process
P2now enters (wait(mutex)), runs, makes condition true, callsx.signal().- Since
x_count > 0,P2incrementsnext_count, wakesP1(signal(x_sem)), and immediately blocks itself onwait(next).
- Since
- Process
P1, now awakened, proceeds from itswait(x_sem). It has the logical lock.P1runs, eventually finishes its procedure.- On exit,
P1seesnext_count > 0, so it callssignal(next), which wakes upP2.
- Process
P2resumes fromwait(next), completes its procedure.- On exit,
P2seesnext_countis now 0, so it callssignal(mutex), finally releasing the monitor lock.
- On exit,
Example Monitor (Go to Figure 6.14)¶
Figure 6.14 shows a simple monitor for allocating a single resource.
- Shared Data:
boolean busy. - Condition:
x(for processes to wait if the resource is busy). - Procedure
acquire(): Ifbusyis true, wait onx. Then setbusy = true. - Procedure
release(): Setbusy = falseand signalx.
Using the semaphore implementation described above, this high-level, safe monitor code can be automatically translated into correct, low-level semaphore operations that manage all the locking and condition queueing.
Final Note: This implementation provides a general proof-of-concept. Real systems may optimize it (e.g., for signal-and-continue semantics, or using native OS support), but the core ideas of using a mutex semaphore and separate condition semapheres remain fundamental.
6.7.3 Resuming Processes within a Monitor¶
Scheduling Which Waiting Process Resumes¶
When multiple processes are suspended on the same condition variable x (via x.wait()), and a x.signal() is called, a decision must be made: which waiting process gets to resume? The default, simplest policy is First-Come, First-Served (FCFS), where the longest-waiting process is resumed first.
However, FCFS is not always suitable. We often need priority-based scheduling (e.g., resuming the process with the shortest upcoming task first to improve system throughput). To support this, monitors can provide a conditional-wait construct.
The Conditional-Wait Construct¶
Syntax: x.wait(c);
xis a condition variable.cis an integer expression (evaluated at the time of thewait()call) called the priority number. This number is stored alongside the suspended process's ID in the condition queue.
Resumption Rule: When x.signal() is executed, the process with the smallest priority number (i.e., highest priority, where lower number often means higher priority) is selected to resume next, not necessarily the one that waited longest.
Example: Priority-Based Resource Allocation¶
Consider the ResourceAllocator monitor from Figure 6.14, modified to use conditional wait. The goal: allocate a single resource to the process that requests it for the shortest duration, minimizing the time others must wait.
Modified acquire() procedure (conceptual):
void acquire(int time) {
if (busy)
x.wait(time); // 'time' is the priority number! Shorter time = higher priority.
busy = true;
}
- A process requesting the resource specifies
time(its planned usage duration). - If the resource is busy (
busy == true), the process callsx.wait(time), usingtimeas its priority number. - The monitor's internal queue for condition
xorders waiting processes by thistimevalue. - When the current holder calls
release()andx.signal(), the process with the smallesttimevalue (shortest requested duration) is awakened and granted the resource next.
This implements a shortest-job-first (SJF) scheduling policy for the resource.
The Persistent Problem: Correct Usage is Not Enforced¶
Monitors automate mutual exclusion and condition signaling, but they cannot enforce correct protocol sequences at the application level. The programmer must still use the monitor correctly, and errors can break everything.
Required Protocol for the ResourceAllocator:
R.acquire(t); // 1. Request access with time t.
// 2. Access the resource...
R.release(); // 3. Release the resource.
Potential Programmer Errors (same as with semaphores):
- Access without acquiring: Skipping
acquire(). ➜ Mutual exclusion violated. - Fail to release: Skipping
release(). ➜ Resource permanently locked, deadlock. - Release without acquiring: Calling
release()without ownership. ➜ Corrupts state (busybecomes false while no one holds it). - Double acquire: Calling
acquire()twice without an intervening release. ➜ Deadlocks itself.
These are logical errors in program flow, not synchronization errors. The monitor's automatic locking doesn't protect against them.
Why This is a Fundamental Limitation¶
- The Monitor is a Gateway, Not a Enforcer: The monitor controls access to the shared data (
busyvariable) but not to the physical resource itself. A buggy or malicious process could bypass the monitor entirely and access the resource directly if the OS/hardware allows it. - Correctness Depends on Client Code: To ensure the system works, we must:
- Verify all user processes call the monitor procedures in the correct sequence (
acquirebeforerelease, etc.). - Ensure no process ignores the monitor and touches the resource directly.
- Verify all user processes call the monitor procedures in the correct sequence (
For small, well-controlled systems (like an OS kernel), this manual verification might be possible. For large, dynamic systems with arbitrary user code, it is impossible to guarantee.
The Path Forward: Stronger Mechanisms¶
This access-control problem—ensuring that resources are only accessed through specific, proper protocols—cannot be solved by synchronization tools alone. It requires system-level security and protection mechanisms.
Solution Preview (See Chapter 17):
- Hardware and OS support are needed to enforce mandatory access control.
- Mechanisms like capabilities or access control lists (ACLs) can be used to grant processes the right to call specific monitor functions only, not to access the underlying resource directly.
- The operating system kernel can act as a reference monitor, intercepting all resource access attempts and validating them against a security policy.
Conclusion: Monitors provide excellent structured synchronization but do not solve the broader resource protection problem. They move the correctness burden from low-level lock management to high-level protocol adherence, but final assurance requires comprehensive system security design.
6.8 Liveness¶
Introduction: The Problem of Indefinite Waiting¶
Synchronization tools are meant to solve the critical-section problem, but their use can introduce a new class of problems: processes may get stuck and fail to make progress. This violates the progress and bounded-waiting requirements.
Liveness is a set of system properties that guarantee processes make forward progress during their lifecycle. A liveness failure occurs when this guarantee is broken. Symptoms include poor performance, unresponsiveness, and hangs. Examples include infinite loops and, in concurrent programming, specific failures caused by synchronization.
We examine two major types of liveness failures: Deadlock and Starvation (covered next).
6.8.1 Deadlock¶
Definition of Deadlock¶
A deadlock occurs when a set of processes are each waiting for an event that can only be caused by another process in the same set. In the context of synchronization, the "event" is typically the release of a resource (a mutex lock, a semaphore, or a condition variable signal).
When deadlocked, no process in the set can ever make progress; they are permanently stuck unless external intervention occurs.
Classic Deadlock Illustration with Two Semaphores¶
Consider two processes, P0 and P1, and two binary semaphores, S and Q, both initialized to 1.
Process P0 executes:
wait(S); // Acquires S.
wait(Q); // Waits to acquire Q.
// ... Critical section using both S and Q ...
signal(S);
signal(Q);
Process P1 executes:
wait(Q); // Acquires Q.
wait(S); // Waits to acquire S.
// ... Critical section using both S and Q ...
signal(Q);
signal(S);
Possible Interleaving Leading to Deadlock:
P0executeswait(S)and acquires S.P1executeswait(Q)and acquires Q.P0executeswait(Q). SinceQis held byP1,P0blocks.P1executeswait(S). SinceSis held byP0,P1blocks.
Result: Deadlock.
P0is waiting forP1tosignal(Q).P1is waiting forP0tosignal(S).- Neither can ever proceed to execute its
signal()operation. Both wait forever.
The Necessary Conditions for Deadlock¶
For a deadlock to be possible, four conditions must hold simultaneously (this is explored in depth in Chapter 8):
- Mutual Exclusion: Resources (like semaphores) are non-shareable (only one process can hold them at a time).
- Hold and Wait: A process holds at least one resource while waiting to acquire additional resources held by other processes.
- No Preemption: Resources cannot be forcibly taken away from a process; they must be released voluntarily.
- Circular Wait: A circular chain of processes exists where each process is waiting for a resource held by the next process in the chain (e.g.,
P0waits forP1,P1waits forP0).
In our example, the circular wait is clear: P0 -> Q (held by P1) -> P1 -> S (held by P0) -> P0.
Preventing the Example Deadlock¶
A common solution is to enforce a total ordering on all resource acquisitions. If every process acquires resources in the same, globally defined order, a circular wait cannot occur.
Fix: Define the order as S before Q. Rewrite P1 to follow the same order:
// P1 (Corrected)
wait(S); // Acquire S first, just like P0.
wait(Q); // Then acquire Q.
// ...
signal(Q);
signal(S);
Now, even if P0 acquires S first, P1 will block on wait(S) and cannot acquire Q. This prevents the hold-and-wait scenario that leads to a circle. Deadlock is impossible.
Key Takeaway: Deadlock is a severe liveness failure where processes are permanently blocked. Careful design of locking protocols (like lock ordering) is necessary to prevent it. Chapter 8 will discuss deadlock in greater detail, including detection, prevention, and avoidance strategies.
6.8.2 Priority Inversion¶
The Problem: High-Priority Process Blocked by Low-Priority Process¶
Priority inversion is a specific liveness failure that occurs in priority-based scheduling systems (where a higher-priority process should always run before a lower-priority one). It breaks this fundamental expectation.
Scenario Setup:
- Three processes with priorities: L (Low), M (Medium), H (High). So: Priority(L) < Priority(M) < Priority(H).
- A shared resource (e.g., a kernel data structure) protected by a lock/semaphore
S.
The Deadly Sequence:
- Low-priority process L acquires the lock
Sand begins using the shared resource. - High-priority process H becomes ready and preempts
L(as expected).Htries to acquire lockSbut finds it held byL.Hblocks, waiting forLto releaseS. - At this point,
Lshould run (to finish and releaseSsoHcan run). However... - Medium-priority process M becomes ready. Since
Mhas higher priority thanL, it preemptsLand starts running. - Result: Process
H(the highest priority) is now indirectly blocked byM(a medium-priority process), even thoughHandMdo not share any resource.Hmust wait not only forLto finish, but forM(and any other medium-priority processes) to finish as well, becauseMpreventsLfrom making progress. This inverts the expected priority order.
Impact: The high-priority process H may be delayed for an unbounded amount of time by lower-priority processes, violating real-time guarantees and causing severe responsiveness issues.
The Solution: Priority-Inheritance Protocol¶
The core fix is to temporarily boost the priority of the low-priority process holding a resource that a high-priority process needs. This is the priority-inheritance protocol.
How it works (applied to the example):
- Process
L(Low) acquires lockS. - Process
H(High) tries to acquireSand blocks. - Inheritance Triggered: The system detects that
His blocked waiting for a resource held byL. Therefore,Linherits the priority ofH.L's priority is temporarily raised to matchH's priority. - Now, when process
M(Medium) becomes ready, it cannot preemptL, becauseLnow has a priority equal toH(which is higher thanM's). - Process
Lcontinues running at high priority, finishes its work with resourceS, and releases the lock. - Priority Reversion: Upon releasing
S,Lloses its inherited priority and reverts to its original low priority. - Correct Scheduling Resumes: Lock
Sis now free. The scheduler will now run the highest-priority ready process, which isH. ProcessHacquiresSand runs. Only afterHblocks or finishes wouldMget a chance to run.
Key Effect: The inheritance protocol prevents medium-priority processes from extending the blocking time of the high-priority process. The blocking time is now limited to the worst-case execution time of the critical section of the low-priority process (L), not an arbitrary chain of medium-priority processes.
Significance¶
Priority inversion is a critical concern in:
- Real-Time Operating Systems (RTOS) where missing deadlines can be catastrophic.
- Any system with priority scheduling and shared resources (including mainstream OS kernels).
The priority-inheritance protocol is a widely implemented solution (e.g., in the POSIX standard and many RTOSes). It ensures that priority inversion is bounded and does not lead to indefinite postponement of high-priority tasks.
6.9 Evaluation¶
Choosing the Right Synchronization Tool¶
We have covered multiple synchronization tools, from hardware instructions to high-level abstractions. All can ensure mutual exclusion when used correctly, but performance and suitability vary greatly. Choosing the right tool is crucial for building efficient concurrent systems. Here is a strategic guide.
Low-Level Foundation: Hardware Instructions¶
- Role:
test_and_set(),compare_and_swap()(CAS), and memory barriers are low-level primitives. They are rarely used directly by application programmers but are the building blocks for all higher-level synchronization tools (mutexes, semaphores). - Modern Trend - Lock-Free Algorithms: CAS is increasingly used to build lock-free (or non-blocking) data structures. These algorithms avoid traditional locks by using optimistic collision detection: attempt an update with CAS; if it fails (because another thread modified the data), retry.
- Philosophy: CAS is optimistic. Assume no conflict, then detect and recover.
- Vs. Traditional Locks: Mutex locks are pessimistic. Assume conflict, so acquire exclusive access before proceeding.
- Challenge: Lock-free algorithms are notoriously difficult to design and prove correct.
Performance Comparison: CAS vs. Traditional Locks¶
The choice between CAS-based approaches and traditional locking (mutex/semaphore) depends heavily on contention—how many threads compete for the resource simultaneously.
Uncontended (No Competition):
- Both are very fast. CAS is somewhat faster because it involves a single atomic hardware instruction, while a mutex lock requires a few more instructions for the lock acquisition path.
Moderate Contention:
- CAS is generally faster—often significantly so. Why?
- In the CAS retry loop, failures are usually resolved after a few spins. The thread never leaves the CPU (no context switch).
- With a contended mutex, a failing
acquire()leads to the thread being suspended, placed on a wait queue, and a context switch—a heavy-weight operation involving kernel intervention. The cost of two context switches often outweighs a few CAS retries.
- CAS is generally faster—often significantly so. Why?
High Contention:
- Traditional locking (with blocking) becomes preferable. Under extreme contention, the CAS approach suffers from cache line bouncing (many CPUs constantly reading/writing the same memory location, invalidating each other's caches) and starvation risk (some threads may retry indefinitely). The overhead of many threads spinning wastes CPU cycles.
- A blocking mutex, while expensive per thread, quiets the system by putting most threads to sleep, reducing memory contention and letting the lock holder proceed efficiently.
Guidelines for Selecting Synchronization Mechanisms¶
For Simple Counters/Updates: Use atomic variables (provided by the language/OS, e.g.,
std::atomicin C++,AtomicIntegerin Java). They are lightweight wrappers around CAS and are far more efficient than a mutex for a single variable update.For Critical Sections on Multiprocessors:
- Short-duration hold: Use a spinlock. The rule of thumb: if the hold time is less than the cost of two context switches, spinning is more efficient. Common in OS kernels.
- Longer or unpredictable hold: Use a blocking mutex (which may internally use a spinlock for its own short critical section, as in semaphore implementation).
Mutual Exclusion vs. Resource Management:
- For simple mutual exclusion, a mutex lock is simpler and has less overhead than a binary semaphore.
- For managing a pool of identical resources (e.g., 5 tape drives), a counting semaphore is the natural fit.
- For scenarios with many readers and few writers, a reader-writer lock can provide better concurrency than a plain mutex.
High-Level Abstractions (Monitors, Condition Variables):
- Primary Appeal: Simplicity and safety. They reduce programmer error by automating lock management.
- Cost: They may have higher overhead and less scalability under high contention due to their inherent structure. However, they are often the best choice for application-level correctness and maintainability.
Real-World Example: Priority Inversion on Mars Pathfinder¶
Highlighted Box: Priority Inversion and the Mars Pathfinder This case study underscores that synchronization failures are not just academic. The Mars Sojourner rover in 1997 experienced repeated system resets due to undetected priority inversion in its VxWorks real-time OS.
- A high-priority task (
bc_dist) was blocked waiting for a shared resource held by a low-priority task (ASI/MET). - Medium-priority tasks then preempted the low-priority task, preventing it from releasing the resource, causing the high-priority task to miss its deadline.
- The watchdog timer (
bc_sched) would then reset the entire system. - Solution: Engineers on Earth enabled the priority-inheritance protocol (a global variable in VxWorks) via remote command. The low-priority task inherited the high priority while holding the resource, preventing preemption, and the problem was solved. This saved the mission.
Ongoing Research and Future Directions¶
The quest for better synchronization tools continues, focusing on:
- More efficient compilers for concurrent code.
- Programming languages with built-in, safe concurrency models (e.g., Rust's ownership, Go's channels, Erlang's actors).
- Performance improvements in existing synchronization libraries and APIs.
In the next chapter, we will see how real operating systems (Linux, Windows, macOS) and APIs (Pthreads) implement these tools in practice.
6.10 Summary¶
Core Concepts¶
- Race Condition: A situation where multiple processes concurrently access and manipulate shared data, and the final outcome depends on the non-deterministic order of these accesses. This leads to data corruption.
- Critical Section: A segment of code where a process accesses and updates shared data. It's the region where a race condition can occur.
- Critical-Section Problem: The challenge of designing a protocol to ensure that when one process is in its critical section, no other process is allowed to be in theirs, thereby enabling safe data sharing.
Requirements for a Correct Solution¶
Any valid solution must satisfy:
- Mutual Exclusion: Only one process can be active in its critical section at a time.
- Progress: If no process is in its critical section and some processes want to enter, the selection of the next process must not be postponed indefinitely and must involve only those willing to enter.
- Bounded Waiting: After a process requests entry to its critical section, there must be a limit on how many times other processes can enter before it does (prevents starvation).
Synchronization Tools: A Hierarchy¶
1. Software Solutions (Conceptual, Not Practical):
- Peterson's Algorithm: A classic two-process solution that logically satisfies all three requirements but fails on modern hardware due to instruction reordering.
2. Hardware Support (The Foundation):
- Memory Barriers (Fences): Instructions that prevent reordering of memory operations across the barrier, ensuring visibility to other processors.
- Atomic Hardware Instructions (e.g., Compare-and-Swap - CAS): Low-level instructions that perform a read-modify-write sequence indivisibly. Used to build all higher-level tools.
- Atomic Variables: Lightweight, language-level abstractions built on CAS for safe updates to single variables (like counters).
3. Low-Level Synchronization Primitives:
- Mutex Locks: Provide mutual exclusion. A process must acquire the lock before entering a critical section and release it after leaving.
- Spinlock: A mutex implemented with busy-waiting. Efficient for very short holds on multiprocessors.
- Semaphores: A more general integer-based synchronization tool.
- Binary Semaphore: Value 0 or 1; functions like a mutex lock.
- Counting Semaphore: Can hold any non-negative integer; manages access to a pool of identical resources.
4. High-Level Abstractions:
- Monitors: An Abstract Data Type (ADT) that encapsulates shared data and the procedures that operate on it. The language runtime automatically provides mutual exclusion for all monitor procedures, making it safer and easier to use.
- Condition Variables: Used within monitors (or with explicit locks) to allow processes to wait for a specific condition (
wait()) and to signal others when the condition may be true (signal()).
Liveness Problems¶
- Liveness: The guarantee that a system will make progress.
- Liveness Failures: Situations where processes cannot make progress.
- Deadlock: Two or more processes are permanently blocked, each waiting for an event that can only be caused by another waiting process. (e.g., circular wait for locks).
- Priority Inversion: A scheduling problem in priority-based systems where a high-priority process is forced to wait indefinitely for a low-priority process, which in turn is preempted by medium-priority processes. Solved by the priority-inheritance protocol.
Performance Evaluation and Selection Guidelines¶
The best synchronization tool depends on the contention level:
- Uncontended/Low Contention: Atomic variables and CAS-based approaches are very fast.
- Moderate Contention: CAS-based approaches often outperform traditional blocking locks, as they avoid costly context switches.
- High Contention: Traditional blocking locks (mutexes, semaphores) become preferable to reduce cache thrashing and CPU waste from excessive spinning.
- General Rules:
- Use the simplest, highest-level tool that works (e.g., a monitor over hand-crafted semaphores) for safety.
- Use a mutex for simple mutual exclusion.
- Use a counting semaphore for managing multiple resource instances.
- Use spinlocks only for very short critical sections on multiprocessors.
Chapter 7: Synchronization Examples¶
Chapter Introduction & Objectives¶
This chapter applies the synchronization tools you learned in Chapter 6 (like mutex locks, semaphores, and monitors) to solve real, classic problems in operating systems.
Remember from Chapter 6: The critical-section problem is about preventing race conditions when multiple processes/threads access shared data. You studied solutions ranging from low-level hardware instructions (like compare-and-swap) to high-level software tools (mutex, semaphore, monitor). You also learned about liveness hazards like deadlock.
Chapter 7 Objectives:
- Explain three classic synchronization problems: Bounded-Buffer, Readers-Writers, and Dining-Philosophers.
- Describe how real OSes (Linux & Windows) handle synchronization.
- Show how to use POSIX and Java APIs for synchronization.
- Enable you to design your own solutions using these APIs.
7.1 Classic Problems of Synchronization¶
This section presents famous problems that act as a standard test for any new synchronization technique. The solutions here use semaphores (the traditional teaching method), but in practice, you could often use mutex locks instead of binary semaphores.
7.1.1 The Bounded-Buffer Problem¶
What is this problem? It's the classic Producer-Consumer problem, first introduced in Chapter 6.1. A producer process creates data items, and a consumer process uses them. They share a fixed-size buffer (a queue of n slots).
Why is it important? It's a perfect model for any scenario where data is passed from one executing entity to another with limited temporary storage (e.g., a print spooler, data pipes, or network packet buffers).
Shared Data Structures¶
The producer and consumer share these synchronization variables:
int n; // Total number of buffers in the pool
semaphore mutex = 1; // Binary semaphore (acts like a mutex lock). Ensures mutual exclusion for accessing the buffer pool itself. Initialized to 1 (unlocked).
semaphore empty = n; // Counting semaphore. Tracks the number of EMPTY buffers. Initially, all `n` buffers are empty.
semaphore full = 0; // Counting semaphore. Tracks the number of FULL buffers. Initially, there are zero full buffers.
- The Buffer Pool: Think of it as an array of
nslots. Themutexprotects the actual insert/remove operations on this array. - The Role of
emptyandfull: These semaphores handle the synchronization condition. The producer must wait if there are no empty slots (empty == 0). The consumer must wait if there are no full slots (full == 0).
The Producer Process (Figure 7.1)¶
while (true) {
... // Produce an item (this takes time and is done OUTSIDE the critical section)
wait(empty); // DECREMENT the 'empty' count. WAIT here if empty == 0 (no free buffers).
wait(mutex); // ENTER critical section. Acquire lock on the buffer pool.
... // Add the produced item to the buffer (this is the actual critical section)
signal(mutex); // EXIT critical section. Release the buffer pool lock.
signal(full); // INCREMENT the 'full' count. Signals consumer that a buffer is now ready.
}
Step-by-step logic for the Producer:
- Produce Item: Creates data. This is outside the critical section for efficiency.
wait(empty);: This is crucial. The producer first checks if there's space. Ifempty > 0, it decrementsemptyand proceeds. Ifempty == 0, the buffer is full, and the producer blocks here, waiting for a consumer to free a slot.wait(mutex);: Now it safely acquires the lock to manipulate the shared buffer (the actual array/index).- Add to Buffer: Performs the actual insert operation.
signal(mutex);: Releases the lock on the buffer.signal(full);: Increments thefullsemaphore. This signals any waiting consumer that there is now at least one item to consume.
The Consumer Process (Figure 7.2)¶
(The structure is symmetrical but opposite.)
while (true) {
wait(full); // DECREMENT the 'full' count. WAIT here if full == 0 (nothing to consume).
wait(mutex); // ENTER critical section. Acquire lock on the buffer pool.
... // Remove an item from the buffer (critical section)
signal(mutex); // EXIT critical section.
signal(empty); // INCREMENT the 'empty' count. Signals producer that a buffer is now free.
... // Consume the item (outside critical section)
}
Step-by-step logic for the Consumer:
wait(full);: Checks if there's anything to consume. Iffull > 0, it decrementsfulland proceeds. Iffull == 0, the buffer is empty, and the consumer blocks.wait(mutex);: Acquires the lock to manipulate the buffer.- Remove from Buffer: Takes an item from the buffer.
signal(mutex);: Releases the lock.signal(empty);: Increments theemptysemaphore. This signals any waiting producer that there is now at least one free slot.- Consume Item: Uses the data. Done outside the critical section.
Why This Solution Works (Key Insights)¶
- Symmetry: The producer "produces" full buffers for the consumer. The consumer "produces" empty buffers for the producer.
- Order of
wait()Calls is SAFE: Notice both processes acquire the condition semaphore (emptyorfull) before the mutual exclusion semaphore (mutex). This prevents deadlock. Imagine if they tookmutexfirst: A producer could lock the buffer and then find it full, but it would be holding the lock while waiting, preventing any consumer from running to empty a buffer. This is a classic deadlock scenario. The presented order avoids it. - Separation of Concerns:
mutexhandles mutual exclusion (short, fast operations).empty/fullhandle synchronization (waiting for a condition). This is a clean and efficient pattern.
In Modern Practice: The mutex semaphore (binary) would typically be implemented as a simpler mutex lock. The empty and full semaphores (counting) are essential condition variables.
7.1.2 The Readers-Writers Problem¶
Problem Definition and Scenario¶
Imagine a shared database accessed by multiple concurrent processes. Processes are of two types:
- Readers: Processes that only read the database. They examine data but do not modify it.
- Writers: Processes that both read and update (write to) the database. They change the data.
The Core Conflict:
- Multiple Readers: If two or more readers access the database simultaneously, it's perfectly safe. No data is being changed, so they will all see a consistent state.
- Writer with Any Other Process: If a writer accesses the database simultaneously with any other process (another writer or a reader), serious problems (chaos) can occur.
- A reader might see partially updated, inconsistent data.
- Two writers might overwrite each other's changes.
The Synchronization Requirement: Writers must have exclusive access to the shared database. When a writer is active, no other readers or writers can be active. This is the essence of the readers-writers problem.
Variations and Priority Policies¶
The problem has tricky variations based on priority, which lead to different behaviors and potential starvation.
The First Readers-Writers Problem:
- Rule: No reader should be kept waiting unless a writer has already obtained permission to use the shared object.
- Priority: Readers have priority.
- How it works: If a reader arrives and a writer is merely waiting (but hasn't started), the reader can jump ahead and start reading. Furthermore, if one reader is active, all other arriving readers can join in immediately.
- Consequence: A steady stream of readers can cause a waiting writer to starve (wait forever). This is known as writer starvation.
The Second Readers-Writers Problem:
- Rule: Once a writer is ready, that writer performs its write as soon as possible.
- Priority: Writers have priority.
- How it works: If a writer is waiting, no new readers are allowed to start reading. The system waits for the current readers to finish, then lets the writer go. New readers must wait until no writers are waiting.
- Consequence: A steady stream of writers can cause waiting readers to starve. This is reader starvation.
The textbook notes that both of these simple solutions can lead to starvation. More complex, starvation-free solutions exist (and are referenced in the bibliography). The solution presented next is for the First Readers-Writers Problem (reader-priority).
Structure of the Consumer Process (Go to Figure 7.2)¶
(This figure is from the previous Bounded-Buffer section and is shown again here for context. Let's recap it clearly.)
The Consumer Process Code:
while (true) {
wait(full); // Step 1: Wait until there's at least one full buffer (blocks if full == 0).
wait(mutex); // Step 2: Acquire lock to access the buffer pool.
... // Critical Section: Remove an item from the buffer.
signal(mutex); // Step 3: Release the buffer pool lock.
signal(empty); // Step 4: Signal that an empty buffer slot has been created.
... // Consume the item (outside the critical section, so it doesn't block others).
}
Important Symmetry: Compare this with the Producer in Figure 7.1. The Consumer is its mirror image: it waits on full, signals empty, and its internal critical section is a remove operation instead of an add.
Key Takeaway for Readers-Writers¶
The Readers-Writers problem is more complex than the Bounded-Buffer because it involves two classes of processes with different synchronization rules:
- Mutual Exclusion for Writers: Must be absolute (like the mutex in the buffer problem).
- Concurrent Access for Readers: Must be allowed, but only in the absence of a writer.
The upcoming solution will need shared variables to track the number of active readers so the first reader locks out writers and the last reader lets them in.
Solution to the First Readers-Writers Problem (Reader-Priority)¶
Shared Data Structures¶
The solution uses the following shared variables:
semaphore rw_mutex = 1; // Binary semaphore. Main lock for writers AND the first/last reader.
semaphore mutex = 1; // Binary semaphore. Protects the shared integer 'read_count' from race conditions.
int read_count = 0; // Integer counter. Tracks the number of currently active readers.
Purpose of Each Variable:
rw_mutex(The "Database Lock"): This is the fundamental semaphore that ensures writers have exclusive access. A writer must hold this to enter its critical section. It also serves a special purpose for readers.mutex(The "Counter Lock"): This semaphore provides mutual exclusion for updatingread_count. Since multiple readers may try to increment/decrementread_countsimultaneously, we must protect it to avoid a race condition on the counter itself.read_count: The key to allowing concurrent reading. It answers the question: "Are we switching between a state with zero readers and a state with some readers?"
The Writer Process (Go to Figure 7.3)¶
The writer's logic is straightforward, similar to a standard mutual exclusion critical section.
while (true) {
wait(rw_mutex); // Acquire exclusive lock on the database. Blocks if held by anyone (writer or first reader).
... // Critical Section: Writing is performed here. Exclusive access guaranteed.
signal(rw_mutex); // Release the lock on the database.
}
- A writer only cares about
rw_mutex. Ifrw_mutexis available (value 1), the writer grabs it and proceeds, locking out all other writers and any new readers. Ifrw_mutexis held (by another writer or by the first active reader), the writer blocks on this semaphore.
The Reader Process (Go to Figure 7.4)¶
The reader's logic is more complex because it must coordinate with other readers.
while (true) {
// ENTRY SECTION
wait(mutex); // Step 1: Lock the read_count variable.
read_count++; // Step 2: Announce I'm becoming an active reader.
if (read_count == 1) { // Step 3: AM I THE *FIRST* READER?
wait(rw_mutex); // Step 3a: If yes, I must lock out writers by acquiring rw_mutex.
}
signal(mutex); // Step 4: Release the lock on read_count. Other readers can now update it.
... // Critical Section: Reading is performed here. Multiple readers can be here concurrently.
// EXIT SECTION
wait(mutex); // Step 5: Lock the read_count variable again.
read_count--; // Step 6: Announce I'm leaving.
if (read_count == 0) { // Step 7: AM I THE *LAST* READER LEAVING?
signal(rw_mutex); // Step 7a: If yes, release the rw_mutex so a waiting writer can proceed.
}
signal(mutex); // Step 8: Release the lock on read_count.
}
How This Solution Works (Step-by-Step Logic)¶
Scenario 1: A Reader Arrives When No One is Active
- Reader acquires
mutex(easy). read_countbecomes 1.- Since
read_count == 1, it executeswait(rw_mutex). This succeeds, lockingrw_mutex. - Releases
mutex. - Now: The reader is reading.
rw_mutexis held (by this first reader), so any arriving writer will block onwait(rw_mutex).
Scenario 2: Another Reader Arrives While One is Reading
- Second reader acquires
mutex. - Increments
read_countto 2. read_count != 1, so it does NOT callwait(rw_mutex).- Releases
mutex. - Now: Both readers are reading concurrently. The second reader did not wait on
rw_mutexbecause it's already held (for writers).
Scenario 3: A Writer Arrives While Readers are Active
- Writer calls
wait(rw_mutex). Sincerw_mutexis held by the first reader, the writer blocks here. - This is the key to reader priority: New readers can still jump in! A reader arriving now will see
read_count > 0, skip thewait(rw_mutex), and start reading immediately, even with a writer waiting.
Scenario 4: The Last Reader Leaves
- When the final reader executes its exit section,
read_countgoes from 1 to 0. - Because
read_count == 0, it callssignal(rw_mutex). - This wakes up a single process waiting on
rw_mutex. This could be a writer or, if no writer is waiting, the next reader that might be stuck atwait(rw_mutex)(this only happens if a writer was waiting in between reader groups). The scheduler decides which one proceeds.
Important Observations from the Text¶
- Queuing Behavior: If a writer is in its critical section, and
nreaders are waiting:- 1 reader is queued on
rw_mutex(the "first" reader of the group, blocked because the writer holds it). n-1readers are queued onmutex(they are stuck trying to incrementread_count, blocked behind the first reader who is holdingmutexwhile it waits onrw_mutex). (Go to Figure 7.4 to trace this: a reader holdsmutexwhen it checksif (read_count == 1)and callswait(rw_mutex). It does not releasemutexuntil after that possible wait.)
- 1 reader is queued on
- Scheduler Decision: The
signal(rw_mutex)call doesn't specify who wakes up. If both readers and writers are waiting, the OS scheduler chooses. This inherent non-determinism can contribute to starvation in this simple solution.
Generalization: Reader-Writer Locks¶
The textbook notes that this problem is so common that many systems provide a reader-writer lock as a direct synchronization primitive.
How it works: A process requests the lock in a specific mode: read mode or write mode.
Rules:
- Multiple processes can hold a read-mode lock concurrently.
- Only one process can hold a write-mode lock, and it must have exclusive access (no other readers or writers).
When to Use Reader-Writer Locks:
- When you can clearly identify which threads are readers-only and which are writers.
- When your application has many more reads than writes. The performance overhead of managing the reader-writer lock is worth it because the increased concurrency (multiple readers) provides a significant speedup over a simple mutex that would allow only one reader at a time.
7.1.3 The Dining-Philosophers Problem¶
Problem Scenario (Go to Figure 7.5)¶
Imagine five philosophers sitting at a circular table. Each philosopher's life alternates between thinking and eating.
- In front of each philosopher is a plate.
- In the center of the table is a bowl of rice (the shared data/resource).
- Between each pair of adjacent philosophers lies a single chopstick. This means there are five chopsticks total.
- Rule: To eat, a philosopher needs two chopsticks—the one to their left and the one to their right. They can only pick up one chopstick at a time, and cannot take a chopstick from a neighbor's hand.
Why This Problem is Important¶
The textbook clarifies: This is a classic problem not because philosophers are important, but because it perfectly models a large class of real-world concurrency problems. It represents the challenge of allocating multiple resources among multiple processes while avoiding both deadlock and starvation.
Mapping to Computing:
- Philosopher = Process/Thread
- Chopstick = Shared Resource (like a tape drive, I/O port, or a lock)
- Eating = Executing in a critical section that requires two resources.
7.1.3.1 Semaphore Solution¶
The Flawed Implementation¶
The textbook presents a direct but flawed semaphore-based implementation.
Shared Data:
semaphore chopstick[5]; // Each semaphore is initialized to 1.
Each chopstick is represented by a binary semaphore. wait() acquires it; signal() releases it.
Structure of Philosopher i (Recap from Figure 7.6):
while (true) {
wait(chopstick[i]); // 1. Pick up left chopstick
wait(chopstick[(i + 1) % 5]); // 2. Pick up right chopstick
... // Eat (Critical Section)
signal(chopstick[i]); // 3. Put down left
signal(chopstick[(i + 1) % 5]); // 4. Put down right
... // Think
}
Analysis of the Flawed Solution¶
- What it DOES guarantee: Mutual exclusion on each chopstick. No two neighboring philosophers can be eating at the same time because they share a chopstick.
- What it DOES NOT prevent: DEADLOCK.
- The Deadlock Scenario (Repeated for Clarity): If all five philosophers execute
wait(chopstick[i])simultaneously before any can get to their secondwait():- Philosopher 0 holds chopstick 0.
- Philosopher 1 holds chopstick 1.
- Philosopher 2 holds chopstick 2.
- Philosopher 3 holds chopstick 3.
- Philosopher 4 holds chopstick 4.
- Now:
chopstick[]array values are all 0 (held). Each philosopher is now blocked forever onwait(chopstick[(i+1) % 5])because that chopstick is held by their neighbor. Circular wait is established. The system halts.
The Core Challenge¶
The Dining-Philosophers problem forces us to design a protocol (synchronization scheme) that satisfies:
- Mutual Exclusion: A chopstick can be held by only one philosopher at a time.
- Hold and Wait: A philosopher holds one chopstick while waiting for another. (We may need to break this condition to prevent deadlock).
- No Preemption: Chopsticks cannot be forcibly taken from a philosopher. (This condition is usually kept).
- Circular Wait: The scenario described above. (This is the condition we must explicitly break).
We also need to guard against starvation (a philosopher might never get to eat even though others do).
Possible Remedies to Prevent Deadlock¶
The textbook suggests several strategies. All aim to break at least one of the four necessary conditions for deadlock (Mutual Exclusion, Hold and Wait, No Preemption, Circular Wait).
Remedy 1: Limit Concurrent Philosophers¶
Allow at most four philosophers to be sitting simultaneously at the table.
- How: Use an additional counting semaphore
tableinitialized to 4. - Implementation: A philosopher must call
wait(table)before attempting to pick up any chopsticks, andsignal(table)after putting them down. - Why it Works (Breaks Circular Wait): With only 4 actors for 5 resources (chopsticks), the Pigeonhole Principle guarantees at least one philosopher will be able to acquire two chopsticks. There will always be at least one free chopstick between two seated philosophers, breaking the inevitable circle of five holds-and-waits.
Remedy 2: Acquire Both Chopsticks Atomically¶
Allow a philosopher to pick up her chopsticks only if both are available.
- How: Use an additional binary semaphore or mutex lock (
mutex) to make the check-and-acquire of both chopsticks a single, atomic (indivisible) operation. - Implementation: The philosopher first
wait(mutex), then check if both her left and right chopsticks are free. If yes, she picks up both (setting their semaphores to 0), thensignal(mutex). If not, she releasesmutexand retries or waits. - Why it Works (Breaks Hold and Wait): A philosopher does not hold any resource while waiting. They are either idle or hold both resources. This eliminates the "hold one, wait for another" scenario.
Remedy 3: Asymmetric Acquisition Order¶
Use an asymmetric solution.
- Rule: Odd-numbered philosophers pick up their left chopstick first, then their right. Even-numbered philosophers pick up their right chopstick first, then their left.
- Why it Works (Breaks Circular Wait): This disrupts the uniform, clockwise circular waiting pattern. Philosopher 0 (even) grabs chopstick 1 (right), while Philosopher 1 (odd) grabs chopstick 1 (left). They now contend for the same first resource (chopstick 1). One will get it, the other will block, preventing the formation of a complete, unbroken circle of holds-and-waits around the table.
Crucial Distinction: Deadlock vs. Starvation¶
The textbook makes a vitally important final point in this section:
"any satisfactory solution ... must guard against the possibility ... of starvation. A deadlock-free solution does not necessarily eliminate the possibility of starvation."
- Deadlock: A global system standstill where no philosopher can make progress. It's usually caused by a structural flaw in the protocol (like the circular wait).
- Starvation (Livelock/Indefinite Postponement): A local, individual problem where one or more philosophers may be indefinitely denied the chance to eat, even though others are eating and the system as a whole is making progress. This can happen due to scheduling policies or timing.
- Example: In the asymmetric solution, if philosophers 1 and 2 are very fast, they might always grab chopsticks just before philosopher 0 can, causing philosopher 0 to repeatedly miss out.
Conclusion: A correct solution must be both deadlock-free and strive to be starvation-free. The remedies above address deadlock; preventing starvation often requires additional fairness mechanisms (like FIFO queues on semaphores).
7.1.3.2 Monitor Solution¶
Introduction to the Monitor Approach¶
This section presents a deadlock-free solution using a monitor, a high-level synchronization construct you learned in Chapter 6. Recall that a monitor is like a class where all data is private, and only one thread can be active inside the monitor's methods at a time, providing automatic mutual exclusion.
The core idea of this solution is Remedy #2 from the previous section: a philosopher picks up chopsticks only if both are available, and this check is performed atomically inside the monitor.
Monitor Data Structures (Go to Figure 7.7)¶
The DiningPhilosophers monitor encapsulates the following shared data:
state[5]: An array tracking each philosopher's status. Each element can be one of three constants:THINKING: Philosopher is not trying to eat.HUNGRY: Philosopher wants to eat (has calledpickup()), but doesn't have both chopsticks yet.EATING: Philosopher has both chopsticks and is eating.
self[5]: An array of condition variables.self[i]is used to suspend philosopheriif she is hungry but cannot eat immediately (because a neighbor is eating). This is where a philosopher waits.
The Monitor Operations Explained¶
Key Helper Function: test(int i)
This function checks if philosopher i can start eating.
void test(int i) {
if ( (state[(i + 4) % 5] != EATING) && // Left neighbor is NOT eating
(state[i] == HUNGRY) && // I am HUNGRY
(state[(i + 1) % 5] != EATING) ) { // Right neighbor is NOT eating
state[i] = EATING; // Change my state to EATING
self[i].signal(); // Wake me up if I was waiting
}
}
- Atomic Check: Because this runs inside the monitor, the check of both neighbors' states is atomic—no interleaving can occur.
- Why
(i+4) % 5? This is the index of the left neighbor (in a circle of 5, moving one step counter-clockwise). - The
signal(): It only has an effect if philosopheriis actually waiting onself[i]. If she's not waiting (the transition from HUNGRY to EATING happened immediately inpickup()), the signal is ignored.
Main Operation 1: pickup(int i)
This is called by a philosopher when she gets hungry.
void pickup(int i) {
state[i] = HUNGRY; // Step 1: Declare intent to eat.
test(i); // Step 2: Check if I can eat right now.
if (state[i] != EATING) { // Step 3: Did the test() succeed?
self[i].wait(); // Step 3a: If not, wait on my condition variable.
}
}
Logic Flow:
- Philosopher
isets her state toHUNGRY. - Calls
test(i). If both neighbors are not eating,test(i)will:- Set
state[i] = EATING - Call
self[i].signal()(which does nothing now, as she's not waiting).
- Set
- The philosopher then checks her own state. If
test(i)succeeded,state[i]is alreadyEATING, so she skips the wait and returns frompickup()immediately to start eating. - If
test(i)failed (a neighbor is eating), her state remainsHUNGRY. She then executesself[i].wait(). This releases the monitor lock and suspends her thread, queuing it on condition variableself[i].
Main Operation 2: putdown(int i)
This is called by a philosopher when she finishes eating.
void putdown(int i) {
state[i] = THINKING; // Step 1: I'm done eating.
test((i + 4) % 5); // Step 2: See if my LEFT neighbor can now eat.
test((i + 1) % 5); // Step 3: See if my RIGHT neighbor can now eat.
}
Logic Flow:
- Philosopher sets her state back to
THINKING. - She tests her two neighbors (left, then right). This is crucial. When a philosopher puts down chopsticks, she might be enabling one or both of her hungry neighbors to eat.
- The
test()function for a neighbor will check that neighbor's conditions. If they areHUNGRYand their other neighbor isn't eating, it will change their state toEATINGand signal their condition variable (self[neighbor].signal()), waking them up. - This cascading wake-up mechanism ensures progress.
How a Philosopher Uses the Monitor¶
Philosopher i follows this exact protocol, which is enforced by programming discipline:
DiningPhilosophers.pickup(i); // Request permission to eat (may block here)
... // EAT (Critical Section)
DiningPhilosophers.putdown(i); // Notify monitor she is done
... // THINK
Why This Solution is Correct¶
- Mutual Exclusion (No Adjacent Eaters): A philosopher enters
EATINGstate only if both neighbors are notEATING. This check is atomic within the monitor. Therefore, two neighbors can never be simultaneously in theEATINGstate. - Deadlock-Free: Deadlock requires a circular wait. Here, a philosopher never holds a resource while waiting. She is either
THINKING(holds nothing),EATING(holds both), orHUNGRYand suspended on a condition variable (holds nothing, monitor lock is released duringwait()). This breaks the "Hold and Wait" condition. - No Race Conditions: All state changes happen inside the monitor, guaranteeing mutual exclusion for the shared
state[]array.
The Remaining Problem: Starvation¶
The textbook explicitly states: "it is possible for a philosopher to starve to death."
- Why? While the solution prevents deadlock, it doesn't guarantee fairness. Imagine a scenario where philosopher 0 and philosopher 2 eat in rapid alternation. Every time philosopher 1 becomes hungry and calls
pickup(1), she might always find that either her left neighbor (0) or right neighbor (2) is eating, causing her towait(). If the scheduling of signals and wake-ups is unlucky, philosopher 1 could be indefinitely postponed. - The Challenge: Preventing starvation requires additional mechanisms, like maintaining a queue in FIFO order. The textbook leaves designing a starvation-free version as an exercise.
7.2 Synchronization within the Kernel¶
This section explores how real operating systems implement synchronization. Windows and Linux are used as examples because they represent different design philosophies. Their synchronization mechanisms have subtle but important differences.
7.2.1 Synchronization in Windows¶
Kernel-Level Synchronization (For OS Developers)¶
The Windows kernel itself is multithreaded and supports real-time applications and multiple processors (SMP). It uses different techniques depending on the hardware:
On Single-Processor Systems:
- Technique: Temporarily masking interrupts.
- How it works: Before accessing a global resource, the kernel disables interrupts from all interrupt handlers that could also access that resource.
- Why it works: On a single CPU, if interrupts are off, the currently running kernel thread cannot be preempted. This guarantees atomic access without needing a lock, as no other thread (including interrupt handlers) can run.
On Multiprocessor (SMP) Systems:
- Technique: Spinlocks.
- How it works: To protect short code segments accessing global data, the kernel uses spinlocks. A processor wanting to enter the critical section "spins" in a busy-wait loop until the lock is released.
- Critical Optimization: The kernel ensures that a thread holding a spinlock is never preempted. This is vital for efficiency. If a thread were preempted while holding a spinlock, other CPUs could waste enormous time spinning, waiting for a thread that isn't even running. This policy minimizes spin time.
User-Level Synchronization: Dispatcher Objects (For Application Developers)¶
For synchronization between threads in user applications, Windows provides a family of dispatcher objects. These are kernel objects that threads can wait on. The common types are:
- Mutex Locks: For mutual exclusion (as you've learned).
- Semaphores: Counting semaphores, as defined in Section 6.6.
- Events: Functionally similar to condition variables. They allow a thread to wait until some "event" or condition is signaled by another thread.
- Timers: Allow a thread to sleep for a specified time or be notified at regular intervals.
The Signaled/Nonsignaled State Model (Go to Figure 7.8)¶
All dispatcher objects share a common abstraction: they exist in one of two states.
- Signaled State: The object is available. A thread trying to acquire it will not block.
- Nonsignaled State: The object is unavailable. A thread trying to acquire it will block and be placed in a wait queue.
State Transition Example - Mutex Lock (Figure 7.8):
- Initial State: Mutex is Signaled (free, available).
- Thread Acquires Lock: When a thread successfully acquires the mutex, its state changes to Nonsignaled.
- Thread Releases Lock: When the owning thread releases the mutex, its state returns to Signaled.
Interaction Between Object State and Thread State¶
This model directly links the dispatcher object's state to the thread's scheduling state:
- A thread that executes a
wait()on a nonsignaled object has its state changed from READY to WAITING. - The thread is then placed in a waiting queue specifically for that object.
- When the object's state transitions to signaled, the Windows kernel checks its wait queue.
- One or more threads are moved from the WAITING state to the READY state and can resume execution.
How Many Threads Wake Up? It depends on the object type:
- Mutex: Only one thread is woken up (because only one can own it). This ensures mutual exclusion.
- Event: All waiting threads are typically woken up. This is useful for broadcast notifications (like a condition variable's
signal_all()).
Efficiency Feature: Critical-Section Objects¶
Windows provides a specialized, lightweight mutual exclusion primitive for user-mode programming: the critical-section object.
- What it is: A user-mode mutex. It performs acquisition and release without entering the kernel when there is no contention (i.e., the lock is free). This is very fast.
- On Multiprocessor Systems: It uses an adaptive strategy:
- First, it spins. The acquiring thread uses a spinlock for a short, predetermined duration, hoping the owning thread on another CPU will release it soon.
- If spinning fails, it allocates a real kernel mutex dispatcher object and yields the CPU, blocking the thread properly.
- Why it's efficient: The kernel object (with its associated overhead) is allocated only when needed (during contention). Since most locks are uncontended most of the time, this provides a massive performance win.
Summary: Windows provides a layered approach: low-level interrupt masking and spinlocks for the kernel itself, and flexible dispatcher objects with a signaled/nonsignaled model for user applications, topped with an optimized critical-section for high-performance user-mode mutual exclusion.
7.2.2 Synchronization in Linux¶
Evolution of the Linux Kernel¶
- Pre-2.6: The Linux kernel was nonpreemptive. A process running in kernel mode could not be interrupted, even if a higher-priority process became ready.
- Version 2.6 and later: The Linux kernel became fully preemptive. A task executing in the kernel can be preempted by a higher-priority task. This change made synchronization more critical and complex within the kernel itself.
Linux Kernel Synchronization Mechanisms¶
Linux offers a toolkit of synchronization primitives, each suited for specific scenarios.
1. Atomic Integers
What it is: The simplest synchronization tool. It's an opaque data type,
atomic_t, which ensures that all read-modify-write operations on it are indivisible (atomic). The hardware provides special instructions for this.How to use it:
atomic_t counter; // Declare an atomic integer int value;
Example Operations & Effects:
atomic_set(&counter, 5);→counter = 5atomic_add(10, &counter);→counter = counter + 10atomic_sub(4, &counter);→counter = counter - 4atomic_inc(&counter);→counter = counter + 1value = atomic_read(&counter);→value = 12(reads value)
Advantage: Extremely efficient. No lock overhead. Operations are performed directly using atomic CPU instructions.
Limitation: Use is narrow. Only protects a single integer variable. Cannot protect complex critical sections involving multiple data items.
2. Mutex Locks
- What it is: The standard sleep lock for protecting critical sections in the kernel. If the lock is unavailable, the calling task sleeps (blocks) and is woken up later.
- How to use it:
mutex_lock(&lock); // Acquire the mutex, sleep if held by another ... // Critical Section mutex_unlock(&lock); // Release the mutex, wake up a waiter
- Important Property: Linux mutexes are nonrecursive. A thread cannot lock a mutex it already owns. A second
mutex_lock()call on the same mutex by the same thread will deadlock the thread.
3. Spinlocks
- What it is: The fundamental busy-wait lock for Symmetric Multi-Processor (SMP) systems. Designed to be held for very short durations (e.g., updating a list pointer).
- How to use it:
spin_lock(&spinlock); // Acquire the spinlock, spin if held ... // Very short critical section spin_unlock(&spinlock); // Release the spinlock
- Important Property: Also nonrecursive.
Key Linux Strategy: Unifying SMP and Single-Processor Systems¶
Linux cleverly abstracts the difference between single-CPU and multi-CPU systems for locking.
On Multiple-Processor (SMP) Machines:
- Use spinlocks directly. Each processor spins on its own CPU cache while waiting.
On Single-Processor (Uniprocessor) Machines:
- Spinlocks are meaningless and wasteful (no other CPU to release the lock). Instead, Linux replaces them with disabling/enabling kernel preemption.
- Mapping:
spin_lock()→preempt_disable()(Prevent the kernel task from being preempted)spin_unlock()→preempt_enable()(Allow preemption again)
- Why this works: On a single CPU, if kernel preemption is disabled, the current thread cannot be interrupted. This guarantees exclusive access, achieving the same effect as a spinlock but without busy-waiting.
The preempt_count Mechanism: Tracking Lock Holders¶
To safely manage preemption, each task has a thread-info structure with a preempt_count field.
- Purpose: This counter indicates how many locks the task currently holds.
- Rules:
- When a lock (spinlock, mutex, etc.) is acquired,
preempt_countis incremented. - When the lock is released,
preempt_countis decremented. - The kernel is safe to preempt only if
preempt_count == 0for the currently running task. Ifpreempt_count > 0, the task holds a lock, and preempting it could lead to deadlock (another task might spin forever waiting for the lock held by the preempted task).
- When a lock (spinlock, mutex, etc.) is acquired,
This system ensures that a task holding a lock is not preempted, respecting the policy mentioned for Windows spinlocks and maintaining efficiency.
Guideline: Choosing the Right Lock¶
- Short Duration Hold (a few instructions): Use spinlocks (SMP) or preemption control (UP).
- Longer Duration Hold (I/O, complex operations): Use semaphores or mutex locks (which put the waiting task to sleep).
- Reader-Writer Variants: Both spinlocks and semaphores have reader-writer versions (
rwlock_t,rw_semaphore) for scenarios with many readers and few writers.
Summary: Linux provides a hierarchy from lightweight atomic integers, to short-hold spinlocks (abstracted for UP systems), to sleep-based mutexes/semaphores. The preempt_count mechanism elegantly enforces the non-preemption rule for lock holders across all these primitives.
7.3 POSIX Synchronization¶
The previous section covered tools for kernel developers. POSIX (Portable Operating System Interface) synchronization, however, is a user-level API available to application programmers. It is a standard interface, not tied to any specific OS kernel (though it's implemented using that OS's underlying tools).
This section covers three primitives in the Pthreads (POSIX Threads) and POSIX API:
- Mutex Locks
- Semaphores
- Condition Variables
These are the primary tools for thread creation and synchronization on UNIX, Linux, macOS, and many other systems.
7.3.1 POSIX Mutex Locks¶
The mutex lock is the most basic synchronization tool in Pthreads. It's used to protect critical sections—a thread acquires the lock before entering, and releases it upon leaving.
Data Type and Initialization¶
The pthread_mutex_t data type represents a mutex lock.
Creating/Initializing a Mutex:
#include <pthread.h>
pthread_mutex_t mutex; // Declare a mutex variable
/* Create and initialize the mutex lock with default attributes */
pthread_mutex_init(&mutex, NULL);
- First Parameter (
&mutex): A pointer to thepthread_mutex_tvariable. - Second Parameter (
NULL): Specifies attributes for the mutex (like type: normal, recursive, error-checking). PassingNULLuses the default attributes (typically a non-recursive, fast mutex).
Acquiring and Releasing the Lock¶
The core operations are pthread_mutex_lock() and pthread_mutex_unlock().
Basic Usage Pattern:
/* Acquire the mutex lock (block if unavailable) */
pthread_mutex_lock(&mutex);
/* BEGIN CRITICAL SECTION */
... // Access shared data here
/* END CRITICAL SECTION */
/* Release the mutex lock */
pthread_mutex_unlock(&mutex);
Behavior:
pthread_mutex_lock(&mutex): The calling thread attempts to acquire the mutex.- If the mutex is available, the function returns immediately, and the thread now owns the lock.
- If the mutex is unavailable (held by another thread), the calling thread blocks (is put to sleep) until the lock's owner releases it.
pthread_mutex_unlock(&mutex): The owning thread releases the mutex. If other threads are blocked waiting for it, one of them is awakened and becomes the new owner.
Error Handling¶
All Pthreads mutex functions return an integer value.
- Return value
0: indicates success. - Return value
> 0(a nonzero error code): indicates failure. The value indicates the specific error (e.g.,EINVALfor invalid mutex,EDEADLKfor potential deadlock with error-checking mutexes).
Good practice is to check these return values in production code. Example:
int ret;
ret = pthread_mutex_lock(&mutex);
if (ret != 0) {
// Handle error (e.g., fprintf(stderr, "Mutex lock failed: %s\n", strerror(ret));
}
Important Characteristics (Implied)¶
- Non-Recursive by Default: With default attributes (
NULL), a thread attempting to lock a mutex it already owns will deadlock itself. - Ownership: Only the thread that acquired (
lock) the mutex can legally release (unlock) it. - Sleep/Wakeup: Waiting threads are put to sleep, not spin. This is efficient for user-level locks that may be held for longer periods.
Summary: POSIX mutex locks provide a standardized, blocking mutual-exclusion mechanism for user-level threads. The pattern is: init() once, lock()/unlock() around critical sections, check return values for errors.
7.3.2 POSIX Semaphores¶
Introduction¶
While mutex locks are part of the core POSIX (Pthreads) standard, semaphores are provided by the POSIX SEM extension. This means they are widely available but technically optional. POSIX specifies two types of semaphores:
- Named Semaphores
- Unnamed Semaphores
They function identically in terms of wait/signal operations but differ in how they are created, identified, and shared. Both are supported in Linux (kernel 2.6+) and macOS.
7.3.2.1 POSIX Named Semaphores¶
Creation and Opening with sem_open()¶
Named semaphores are identified by a name string (like a filename). They are primarily used for synchronization between unrelated processes (not just threads of the same process).
#include <semaphore.h>
sem_t *sem;
/* Create the semaphore and initialize it to 1 */
sem = sem_open("/SEM", O_CREAT, 0666, 1);
Function Parameters Explained:
"/SEM": The name of the semaphore. It must begin with a forward slash/. This name is used system-wide.O_CREAT: A flag indicating the semaphore should be created if it doesn't already exist. If it exists,sem_open()just opens it.0666: The permissions for the semaphore (like file permissions).0666means read and write for owner, group, and others.1: The initial value of the semaphore.
Return Value: sem_open() returns a pointer to a semaphore (sem_t *) on success, or SEM_FAILED on error.
Key Advantage: Inter-Process Communication (IPC)¶
Named semaphores have global visibility in the system (via the name). Any unrelated process can synchronize with another by simply calling sem_open() with the same semaphore name. The OS provides the same underlying semaphore object. This makes them ideal for coordinating separate programs.
Operations: sem_wait() and sem_post()¶
POSIX uses different names for the classic semaphore operations:
sem_wait(sem): Equivalent towait()orP(). Decrements the semaphore. If the value becomes negative, the calling thread/process blocks.sem_post(sem): Equivalent tosignal()orV(). Increments the semaphore. If other threads/processes are blocked waiting, one is awakened.
Usage Pattern to Protect a Critical Section:
/* Acquire the semaphore (decrement) */
sem_wait(sem);
/* BEGIN CRITICAL SECTION */
... // Access shared resource (could be in shared memory)
/* END CRITICAL SECTION */
/* Release the semaphore (increment) */
sem_post(sem);
Important Considerations¶
- System Persistence: Named semaphores have kernel persistence. They continue to exist until explicitly removed with
sem_unlink()or the system reboots. This is why the name often resembles a file path—it's managed by the OS. - Cleanup: A process should use
sem_close(sem)when done using the semaphore to release local resources. Thesem_unlink("/SEM")system call removes the semaphore name and object from the system when no processes have it open. - Initial Value: The initial value determines the semaphore's type:
1→ Binary semaphore (often used as a mutex).N→ Counting semaphore (allows N concurrent accesses).
Summary: POSIX named semaphores are a powerful, system-wide synchronization primitive, ideal for coordinating between separate processes using a simple string name. Their API mirrors the classic semaphore operations with sem_wait() and sem_post().
7.3.2.2 POSIX Unnamed Semaphores¶
Creation and Initialization with sem_init()¶
Unnamed semaphores (also called memory-based semaphores) are created in memory allocated by the program, rather than by the kernel with a global name. They are initialized using the sem_init() function.
#include <semaphore.h>
sem_t sem; // Declare a semaphore variable (usually in shared memory or global data)
/* Create the semaphore and initialize it to 1 */
sem_init(&sem, 0, 1);
Function Parameters Explained:
&sem: A pointer to thesem_tsemaphore variable. This is the memory location where the semaphore state will be stored.0: Thepshared(process-shared) flag.0: The semaphore is private to the process that created it. It can only be shared among threads within the same process. This is the most common use case.- Non-zero (
1typically): The semaphore can be shared between different processes. For this to work, thesem_tvariable must be placed in a region of shared memory that all cooperating processes can access. The semaphore operations will then work across process boundaries.
1: The initial value of the semaphore.
Key Difference from Named Semaphores: Scope and Lifetime¶
- Scope: Unnamed semaphores are identified by a memory address, not a system-wide name. They are only accessible to threads/processes that have a pointer to that memory location.
- Lifetime: The semaphore exists as long as the memory that contains it exists.
- If it's a global/static variable, it lasts the program's lifetime.
- If it's in shared memory, it lasts as long as that memory segment exists.
- If it's on the stack, it's destroyed when the function returns (this is usually an error).
- No Kernel Persistence: Unlike named semaphores, they are not tracked by the kernel after the creating process terminates (unless in persistent shared memory).
Operations: sem_wait() and sem_post()¶
The operations are identical to named semaphores, but you pass the address of the semaphore variable.
Usage Pattern for a Process-Local (Thread) Semaphore:
/* Acquire the semaphore (decrement) */
sem_wait(&sem);
/* BEGIN CRITICAL SECTION */
... // Access shared data between threads of this process
/* END CRITICAL SECTION */
/* Release the semaphore (increment) */
sem_post(&sem);
Error Handling¶
As with mutex functions, all POSIX semaphore functions return an integer:
- Return value
0: Success. - Return value
-1and setserrno: Failure. The error code is stored in the globalerrnovariable. You must check for-1and then inspecterrno(e.g., usingperror()orstrerror(errno)).
Example with error checking:
if (sem_init(&sem, 0, 1) == -1) {
perror("sem_init failed");
exit(EXIT_FAILURE);
}
Cleanup¶
When an unnamed semaphore is no longer needed, it should be destroyed using sem_destroy() to free any internal resources.
sem_destroy(&sem);
Important: sem_destroy() should only be called when no thread is blocked waiting on the semaphore, and no further operations will be performed on it.
When to Use Unnamed vs. Named Semaphores¶
Use Unnamed Semaphores (
sem_init) when:- Synchronizing threads within a single process (use
pshared = 0). - Synchronizing processes that have a shared memory region (use
pshared = 1and place semaphore in shared memory). - You want explicit control over the semaphore's memory location and lifetime.
- Synchronizing threads within a single process (use
Use Named Semaphores (
sem_open) when:- Synchronizing unrelated processes that do not share memory.
- You want the convenience of a system-wide name for easy reference.
- You need kernel persistence beyond the lifetime of the creating process.
Summary: POSIX unnamed semaphores are lightweight, memory-based semaphores ideal for intra-process thread synchronization or inter-process synchronization via shared memory. They follow the same sem_wait()/sem_post() semantics but are managed with sem_init() and sem_destroy().
7.3.3 POSIX Condition Variables¶
Introduction and Context¶
Condition variables in Pthreads behave like the classic condition variables from Section 6.7, but with a crucial implementation difference. In Section 6.7, condition variables were used inside a monitor, which automatically provided mutual exclusion.
Since Pthreads is used in C (a language without built-in monitors), mutual exclusion is not automatic. Therefore, in Pthreads, a condition variable must always be explicitly associated with a mutex lock to protect the shared data involved in the condition.
Data Type and Initialization¶
The pthread_cond_t data type represents a condition variable.
Creating/Initializing a Condition Variable and its Mutex:
pthread_mutex_t mutex;
pthread_cond_t cond_var;
/* Initialize the mutex with default attributes */
pthread_mutex_init(&mutex, NULL);
/* Initialize the condition variable with default attributes */
pthread_cond_init(&cond_var, NULL);
Waiting on a Condition: pthread_cond_wait()¶
The pattern for waiting is critical and must be followed exactly to avoid race conditions.
Correct Waiting Pattern:
pthread_mutex_lock(&mutex); // STEP 1: ACQUIRE the associated mutex
while (a != b) { // STEP 2: CHECK the condition UNDER THE LOCK
pthread_cond_wait(&cond_var, &mutex); // STEP 3: WAIT (atomically releases mutex)
}
// When loop exits, condition (a == b) is TRUE, and thread HOLDS the mutex again.
pthread_mutex_unlock(&mutex); // STEP 4: RELEASE the mutex
Step-by-Step Explanation of pthread_cond_wait(&cond_var, &mutex):
- The thread must hold
mutexbefore callingpthread_cond_wait(). This protects the evaluation of the condition (a != b). - The function atomically (as one indivisible step):
- Releases the
mutex, allowing other threads to acquire it and change the shared data (aandb). - Puts the calling thread to sleep, queuing it on the condition variable
cond_var.
- Releases the
- When the thread is later signaled and wakes up, the function atomically:
- Re-acquires the
mutexbefore returning to the caller. - The thread must then re-check the condition in a
whileloop. NEVER use anifstatement. This is because:- Spurious Wakeups: The thread can wake up without being signaled (allowed by some implementations for performance).
- Multiple Waiters: When
pthread_cond_signal()wakes one thread, another might have changed the condition before the woken thread runs.
- Re-acquires the
Signaling a Condition: pthread_cond_signal()¶
A thread that changes the shared data and potentially makes the condition true must signal.
Correct Signaling Pattern:
pthread_mutex_lock(&mutex); // STEP A: ACQUIRE the SAME mutex
a = b; // STEP B: Modify shared data (makes condition true)
pthread_cond_signal(&cond_var); // STEP C: Signal ONE waiting thread
pthread_mutex_unlock(&mutex); // STEP D: RELEASE the mutex
Important Behavior of pthread_cond_signal():
- It does NOT release the mutex. The signaling thread continues to hold the lock.
- The signaled thread cannot wake up and return from
pthread_cond_wait()until the signaling thread releases the mutex (viapthread_mutex_unlock()). - Only one waiting thread is awakened. Use
pthread_cond_broadcast(&cond_var)to wake all waiting threads.
The Wake-Up Sequence¶
- Signaling thread calls
pthread_cond_signal(). - Signaling thread calls
pthread_mutex_unlock(). - The mutex becomes available.
- The signaled thread (which was blocked in
pthread_cond_wait()) can now re-acquire the mutex and return from the call. It then re-checks the loop condition.
Key Rules to Remember¶
- Always associate a condition variable with a specific mutex protecting the shared data.
- Always hold the mutex when evaluating the condition (
while (condition)). - Always use a
whileloop, never anif, when waiting. - Signal/Broadcast only while holding the mutex (for predictability, though it's sometimes technically allowed otherwise).
- The mutex ensures that checking the condition and going to sleep is atomic, closing a crucial timing window where a signal could be missed.
7.4 Synchronization in Java¶
Java has provided built-in support for thread synchronization since its beginning. This section covers Java's original mechanism (monitors) and three important mechanisms introduced in Java 5: reentrant locks, semaphores, and condition variables. The Java API includes many other concurrency features (like atomic variables and CAS operations), but this section focuses on the most common tools.
7.4.1 Java Monitors¶
The Monitor-Like Mechanism¶
Java provides a concurrency mechanism that is functionally similar to a monitor. This is illustrated using the BoundedBuffer class (Go to Figure 7.9), which solves the bounded-buffer problem. The producer calls the insert() method and the consumer calls the remove() method.
The Intrinsic Lock¶
Every Java object has a single intrinsic lock (also called a monitor lock). This lock is the foundation of Java's synchronization.
Declaring a Synchronized Method:
When a method is declared with the synchronized keyword, a thread must own the object's intrinsic lock before it can execute that method.
- In the
BoundedBufferclass (Figure 7.9), bothinsert()andremove()are declared aspublic synchronized void. - This means a thread must acquire the lock on the
BoundedBufferobject instance before inserting or removing an item.
The BoundedBuffer Class Skeleton (Figure 7.9 Details)¶
public class BoundedBuffer<E> {
private static final int BUFFER_SIZE = 5;
private int count, in, out;
private E[] buffer;
public BoundedBuffer() {
count = 0; in = 0; out = 0;
buffer = (E[]) new Object[BUFFER_SIZE];
}
/* Producers call this method */
public synchronized void insert(E item) {
// Implementation shown later (Figure 7.11)
}
/* Consumers call this method */
public synchronized E remove() {
// Implementation shown later (Figure 7.11)
}
}
Important Points:
- The
synchronizedkeyword on both methods uses the same intrinsic lock (the lock of theBoundedBufferobject). - This guarantees that only one thread can be inside either
insert()orremove()at any time, providing mutual exclusion for the buffer's internal state (count,in,out,buffer). - However, mutual exclusion alone is not enough for the bounded-buffer problem. We also need a way for threads to wait when the buffer is full (producer) or empty (consumer). This requires condition variables, which Java provides via
wait()andnotify()methods (to be covered in the next section).
Key Characteristics of Java Intrinsic Locks:¶
- Reentrant: A thread that already owns a lock can acquire it again (e.g., via recursive calls). The lock has an internal hold count.
- Automatic Release: The lock is released when the synchronized method/block exits, even if due to an exception.
- No Explicit Lock Object: The lock is implicitly tied to the object instance.
- One Lock Per Object: This can be a limitation if you need different locks for different data within the same object (solved by using separate lock objects or synchronized blocks).
Lock Acquisition, Entry Set, and Scheduling¶
How it works step-by-step:
- Available Lock: If the lock is free when a thread calls a
synchronizedmethod, the thread immediately becomes the owner and enters the method. - Lock Already Held: If another thread already owns the lock, the calling thread blocks (stops execution) and is placed in a waiting area called the entry set for that object's lock.
- Lock Release: When the owning thread exits the synchronized method, it automatically releases the lock.
- Selecting the Next Owner: If the entry set is not empty, the Java Virtual Machine (JVM) selects one thread from the set to become the new lock owner. The Java specification says this selection is arbitrary (not defined), but in practice, most JVMs use a FIFO (First-In, First-Out) policy.
- The selected thread then owns the lock and proceeds into its synchronized method.
Visualizing the Entry Set (Go to Figure 7.10): The entry set operates like a queue (typically FIFO) of threads waiting to acquire the object's lock. When the lock is released, the JVM picks a thread from the front of this queue.
The Wait Set¶
In addition to the entry set (threads waiting for the lock), every Java object has an associated wait set—a set of threads that have temporarily released the lock because a specific condition was not met. This set is initially empty.
When is the Wait Set Used? A thread owning the lock inside a synchronized method may find it cannot proceed because a necessary condition is false. For example:
- A producer in
insert()finds the buffer full. - A consumer in
remove()finds the buffer empty.
In such cases, the thread must release the lock and wait until the condition becomes true. This is done using the wait() method.
What Happens When wait() is Called¶
When a thread calls wait(), three atomic steps occur:
- Releases the lock for the object.
- Changes its state to blocked.
- Places itself in the object's wait set.
Complete Walkthrough Using Figure 7.11¶
/* Producers call this method */
public synchronized void insert(E item) {
while (count == BUFFER_SIZE) { // MUST USE WHILE LOOP
try {
wait();
}
catch (InterruptedException ie) { }
}
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
count++;
notify(); // Notify a waiting consumer
}
/* Consumers call this method */
public synchronized E remove() {
E item;
while (count == 0) { // MUST USE WHILE LOOP
try {
wait();
}
catch (InterruptedException ie) { }
}
item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
notify(); // Notify a waiting producer
return item;
}
Assume the buffer is full and the object's lock is free.
Producer calls
insert():- Acquires the lock (available).
- Enters method.
while (count == BUFFER_SIZE)is true.- Calls
wait(). - Releases lock, blocks, enters wait set.
Consumer eventually calls
remove():- Acquires the now-available lock.
- Enters method.
- Takes an item (
count--). - Calls
notify()(consumer still holds the lock). - Producer is moved from wait set to entry set, state becomes runnable.
Consumer exits
remove():- Releases the lock.
Producer (now in entry set):
- Competes for and reacquires the lock.
- Returns from
wait()call. - Re-evaluates
whileloop condition (count == BUFFER_SIZE). Now false (since consumer removed an item). - Proceeds to insert its item, increments
count, callsnotify()(in case a consumer is waiting), and exits, releasing the lock.
Visualizing Entry and Wait Sets (Go to Figure 7.12)¶
The figure illustrates the structure:
- Lock Owner: The single thread currently executing inside a synchronized method.
- Entry Set: Threads waiting to acquire the lock.
- Wait Set: Threads that have voluntarily released the lock via
wait()and are waiting to be notified of a condition change.
Signaling with notify()¶
When a thread changes the state in a way that might satisfy a waiting thread's condition, it calls notify() (while still holding the lock).
What notify() does:
- Selects an arbitrary thread T from the object's wait set (in practice, most JVMs use FIFO).
- Moves T from the wait set to the entry set.
- Changes T's state from blocked to runnable.
Thread T is now eligible to compete for the lock again. Once T reacquires the lock, it returns from the wait() call and re-checks the condition in the while loop.
Critical Details from the Code (Figure 7.11)¶
- Always use
wait()in awhileloop, never anif. This is mandatory because:- Spurious wakeups can occur (the thread can wake up without a
notify()). - When multiple threads are waiting for the same condition, another thread might "steal" the resource between being woken and reacquiring the lock. The
whileloop re-tests the condition after waking.
- Spurious wakeups can occur (the thread can wake up without a
notify()vs.notifyAll():notify()wakes one waiting thread.notifyAll()wakes all threads in the wait set. UsenotifyAll()when multiple waiting threads might all be able to proceed after the state change, or when different threads are waiting for different conditions.InterruptedException:wait()can throw this if the thread is interrupted. The example code catches and ignores it for simplicity, but real applications should handle it appropriately (often by restoring the interrupt status).- Ignored
notify(): If the wait set is empty,notify()has no effect (it is ignored).
Historical Context¶
The synchronized, wait(), and notify() mechanisms are Java's original concurrency tools. While still valid and widely used, later Java versions introduced more flexible and robust concurrency utilities in the java.util.concurrent package, which we will examine next.
Block Synchronization for Finer-Grained Control¶
The scope of a lock is the time between its acquisition and release. Synchronizing an entire method can create an overly large scope if only a small part of the method manipulates shared data. This reduces concurrency.
Java therefore supports block synchronization, allowing you to synchronize only the critical section.
Example of Block Synchronization:
public void someMethod() {
/* non-critical section */
// Can run concurrently with other threads
synchronized(this) { // Acquire lock on 'this' object
/* critical section */
// Only one thread at a time can execute here
} // Lock is automatically released here
/* remainder section */
// Can run concurrently again
}
- Only the code inside the
synchronizedblock requires ownership of the lock (thisin this example, but it can be any object). - This design minimizes lock scope, improving potential parallelism by allowing the non-critical sections of multiple threads to proceed concurrently.
7.4.2 Reentrant Locks¶
Introduction and Basic Functionality¶
The ReentrantLock is the simplest explicit locking mechanism in the Java API. It serves a similar purpose to the synchronized statement: it provides mutual exclusion and is owned by a single thread at a time.
Key Similarity to synchronized:
- Mutually exclusive access to a shared resource.
- Reentrant: A thread that already owns the lock can acquire it again without deadlocking. The lock maintains a hold count.
Key Advantages over synchronized:
- Fairness Policy: Can be created as a fair lock (via constructor
new ReentrantLock(true)). A fair lock favors granting the lock to the longest-waiting thread, reducing the chance of starvation. (The JVM specification does not require fairness for intrinsic locks' wait/entry sets). - Explicit Control: Lock acquisition and release are explicit method calls, allowing more complex patterns.
- Additional Features:
tryLock()(non-blocking attempt),lockInterruptibly()(acquire with interruptible waiting), etc.
Basic Usage Pattern¶
A ReentrantLock implements the Lock interface. The critical and mandatory usage pattern is as follows:
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
Lock key = new ReentrantLock(); // Create a non-fair lock
key.lock(); // Acquire the lock (blocks if unavailable)
try {
/* CRITICAL SECTION */
} finally {
key.unlock(); // Release lock in finally block
}
The Importance of the try-finally Pattern¶
The textbook provides a crucial explanation of why this pattern is necessary and why lock() is placed outside the try block.
Why unlock() must be in finally:
The lock must be released after the critical section completes, whether it completes normally or if an exception is thrown inside the critical section. The finally block guarantees execution.
Why lock() is outside the try block:
Consider this incorrect pattern:
try {
key.lock(); // Acquire lock inside try
// critical section
} finally {
key.unlock(); // Always unlock
}
Problem: If an unchecked exception (e.g., OutOfMemoryError) occurs during the lock() call itself (before acquisition), the finally block will still execute unlock(). Since the lock was never acquired, unlock() throws an IllegalMonitorStateException. This masks/hides the original exception (OutOfMemoryError), making debugging extremely difficult.
Correct pattern ensures:
- If
lock()succeeds, the lock is definitely acquired. - If
lock()fails with an unchecked exception,unlock()is not called, avoiding a misleading secondary exception.
ReentrantReadWriteLock¶
The standard ReentrantLock provides exclusive access, which can be overly restrictive when many threads only read shared data (as in the Readers-Writers Problem, Section 7.1.2).
Java provides ReentrantReadWriteLock to address this:
- It contains two locks: a read lock and a write lock.
- Multiple threads can hold the read lock concurrently (as long as no thread holds the write lock).
- Only one thread can hold the write lock, and it must have exclusive access (no other readers or writers).
Usage:
import java.util.concurrent.locks.ReentrantReadWriteLock;
ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
// For reading:
rwLock.readLock().lock();
try {
// Multiple readers can be here
} finally { rwLock.readLock().unlock(); }
// For writing:
rwLock.writeLock().lock();
try {
// Only one writer can be here, exclusive
} finally { rwLock.writeLock().unlock(); }
Benefit: Increases concurrency for read-heavy workloads compared to a simple mutex.
7.4.3 Semaphores¶
Introduction and Construction¶
Java provides a counting semaphore via java.util.concurrent.Semaphore. The constructor sets the initial permit count:
Semaphore sem = new Semaphore(1); // Binary semaphore (mutex)
- A negative initial value is allowed (though unusual), meaning initial acquires must wait until enough
release()calls occur.
Basic Usage for Mutual Exclusion¶
The standard pattern for using a semaphore as a mutex:
Semaphore sem = new Semaphore(1);
try {
sem.acquire(); // wait() or P() operation - can throw InterruptedException
/* critical section */
} catch (InterruptedException ie) {
// Handle interruption (e.g., restore interrupted status)
} finally {
sem.release(); // signal() or V() operation - MUST be in finally
}
Key Details and Differences from Locks¶
- Interruptible:
acquire()throwsInterruptedExceptionif the waiting thread is interrupted. This provides a way to cancel wait operations. - No Ownership Concept: Unlike
ReentrantLock, a semaphore has no notion of an owning thread. Any thread can callrelease()on a semaphore, not just the thread that calledacquire(). (This is true for classic semaphores but must be used carefully.) - Multiple Permits: The constructor can take any integer
N, allowing up toNthreads in the critical section concurrently. - Try Acquire:
tryAcquire()attempts to get a permit without blocking; returnstrue/false. - Fairness: Can be created with fairness setting:
new Semaphore(1, true).
Critical Requirement: Release in Finally¶
Just as with ReentrantLock.unlock(), the sem.release() must be in a finally block. This ensures the permit is always returned to the semaphore, even if the critical section throws an exception. Failure to do so could permanently reduce the number of available permits, eventually stalling the system.
7.4.4 Condition Variables¶
Introduction and Creation¶
Condition variables in the Java API (java.util.concurrent.locks.Condition) provide functionality similar to wait() and notify(), but with greater flexibility. They must be associated with an explicit ReentrantLock (not an intrinsic lock).
Creating a Condition Variable:
Lock key = new ReentrantLock();
Condition condVar = key.newCondition(); // Creates condition variable bound to 'key'
await()corresponds towait()signal()corresponds tonotify()signalAll()corresponds tonotifyAll()
Key Advantage: Multiple Condition Variables per Lock¶
This is the primary improvement over Java's original monitor design.
- Original Java Monitor: Each object has only one unnamed condition variable (accessed via
wait()/notify()). When a thread is awakened bynotify(), it doesn't know why—it must re-check its specific condition among possibly many. ConditionObjects: You can create multipleConditionobjects from oneReentrantLock. Each can represent a different logical condition. Threads can wait on a specific condition and be signaled only when that condition becomes true.
Example: Thread Turn-Taking (Go to Figure 7.13)¶
The example shows 5 threads (0-4) that must take turns based on a shared variable turn.
Setup:
Lock lock = new ReentrantLock();
Condition[] condVars = new Condition[5];
for (int i = 0; i < 5; i++) {
condVars[i] = lock.newCondition(); // Each thread gets its own condition
}
The doWork() Method Logic:
public void doWork(int threadNumber) {
lock.lock(); // Acquire the mutual exclusion lock
try {
// Wait if it's not this thread's turn
if (threadNumber != turn) {
condVars[threadNumber].await(); // Wait on THIS thread's condition
}
/* DO THE WORK (Critical Section) */
// Update turn to next thread
turn = (turn + 1) % 5;
// Signal ONLY the thread whose turn is next
condVars[turn].signal();
} catch (InterruptedException ie) {
// Handle interruption
} finally {
lock.unlock(); // Always release lock
}
}
How It Works Step-by-Step¶
- Lock Acquisition: Thread calls
lock.lock(). - Condition Check: If it's not the thread's turn (
threadNumber != turn), it callsawait()on its specific condition variable.await()does atomically: Releases the associated lock + blocks thread + places it in wait set for that specificCondition.
- Work Execution: The thread whose turn it is proceeds with its work.
- Turn Update & Signal: After work, thread updates
turnto next thread number and callssignal()on that next thread's condition variable.signal(): Wakes one thread waiting on that specific condition. The signaling thread still holds the lock.
- Lock Release & Wake-Up: Signaling thread exits
tryblock and executesfinally { lock.unlock(); }. This releases the lock, allowing the signaled thread to re-acquire the lock and return fromawait(). - Resumption: The awakened thread (whose turn it now is) checks the condition again (
if (threadNumber != turn)). Now condition is false, so it proceeds to do its work.
Important Technical Details¶
- No
synchronizedKeyword: Mutual exclusion comes from theReentrantLock, notsynchronized. await()Releases Lock: Just likewait(),await()atomically releases the associated lock before blocking.signal()Does NOT Release Lock: The signaling thread keeps the lock until it explicitly callsunlock().- Interruptible:
await()throwsInterruptedException. - Always Use in
try-finally: The lock must be released infinally. - Condition Predicate: The example uses an
ifstatement because each thread waits on a unique condition (its own turn). However, for shared conditions where multiple threads could be waiting for the same logical condition, you must use awhileloop to re-check after waking.
Comparison with Original Java Monitors¶
| Feature | Original Monitor (synchronized + wait/notify) |
ReentrantLock + Condition |
|---|---|---|
| Condition Variables | One unnamed condition per object | Multiple named conditions |
| Lock Fairness | Not guaranteed | Can specify fair locking |
| Try Lock | No | tryLock() available |
| Interruptible Wait | Yes (via InterruptedException) |
Yes |
| Nested Locking | Automatic reentrancy | Automatic reentrancy |
Summary: Java's Condition interface solves the limitation of single-condition waiting by allowing multiple explicit condition variables per lock, enabling precise thread signaling. The turn-taking example demonstrates how each thread can wait on and be signaled via its own dedicated condition, eliminating the "thundering herd" and unnecessary wake-ups of a single notify().
7.5 Alternative Approaches¶
The rise of multicore systems has increased the need for concurrent applications, but traditional synchronization (mutexes, semaphores, monitors) becomes harder to manage as core counts grow—increasing risks of race conditions and deadlock. This section explores newer language and hardware features designed to make concurrent programming safer and more scalable.
7.5.1 Transactional Memory¶
Concept and Origin¶
Transactional Memory (TM) borrows the transaction concept from database systems and applies it to memory operations. A memory transaction is a sequence of read-write operations that are atomic: either all complete successfully (commit) or none take effect (abort and roll back).
The Problem with Traditional Locking¶
Consider a typical update() function using a mutex:
void update() {
acquire(); // Programmer manages lock acquisition
/* modify shared data */
release(); // Programmer manages lock release
}
Problems:
- Deadlock Risk: Incorrect lock ordering can cause deadlock.
- Scalability Issues: High contention as thread count increases—threads spend more time waiting for locks.
- Complexity: Programmer must correctly identify critical sections and choose appropriate locks (e.g., reader-writer locks).
Transactional Memory Solution¶
A programming language can add an atomic{S} construct. Operations inside block S execute as a single transaction.
Rewritten update() function:
void update() {
atomic {
/* modify shared data */ // System guarantees atomicity
}
}
Advantages:
- Automatic Atomicity: The transactional memory system (not the programmer) guarantees atomic execution.
- No Deadlock: Since there are no explicit locks, deadlock from lock acquisition order is impossible.
- Automatic Concurrency Detection: The TM system can automatically identify which operations can run concurrently (e.g., concurrent reads) without the programmer manually implementing reader-writer locks.
- Simpler Code: The programmer declares what should be atomic, not how to make it atomic.
Implementation Approaches¶
Software Transactional Memory (STM):
- Implemented entirely in software (compiler and libraries).
- The compiler inserts instrumentation code around transaction blocks to track reads/writes and manage conflicts.
- Advantage: Works on existing hardware.
- Disadvantage: Higher runtime overhead due to instrumentation.
Hardware Transactional Memory (HTM):
- Uses hardware support (modified CPU caches and cache coherency protocols).
- Leverages the cache hierarchy to track memory accesses within a transaction and detect conflicts between cores.
- Advantage: Lower overhead (no instrumentation), faster conflict detection.
- Disadvantage: Requires new hardware (modified caches and coherence protocols).
How It Works (Conceptual)¶
- A thread starts a transaction and executes reads/writes to shared data.
- These writes are initially speculative (held in a private buffer or cache).
- At the end:
- If no conflicts (no other thread wrote to same locations), the transaction commits: speculative writes become visible to all.
- If a conflict is detected (another thread wrote to overlapping data), the transaction aborts: speculative writes are discarded, and the thread may retry.
Current Status and Challenges¶
Transactional memory has been researched for years but saw renewed interest with multicore proliferation. Challenges include:
- Integration with existing code and I/O operations.
- Performance of STM vs. HTM.
- Hardware support required for HTM (available in some modern CPUs like Intel TSX, but not universal).
7.5.2 OpenMP¶
Recap: OpenMP Basics¶
As covered in Section 4.5.2, OpenMP is an API for shared-memory parallel programming. It uses compiler directives (like #pragma omp parallel) to mark parallel regions, where the OpenMP runtime automatically creates a team of threads (typically equal to the number of CPU cores). The key advantage: thread management is handled by the library, not the programmer.
Synchronization in OpenMP: The critical Directive¶
OpenMP provides the #pragma omp critical directive to protect critical sections. Code inside the block following this directive becomes a mutually exclusive region—only one thread can execute it at a time.
Example: Race Condition and Fix¶
Problematic Code (Race Condition):
int counter = 0; // Shared variable
void update(int value) {
counter += value; // NOT thread-safe! Read-modify-write race.
}
If update() is called from within a parallel region (e.g., inside #pragma omp parallel), multiple threads may read, modify, and write counter concurrently, causing lost updates.
Solution with critical Directive:
void update(int value) {
#pragma omp critical
{
counter += value; // Only one thread executes this block at a time
}
}
How the critical Directive Works¶
- It behaves like a binary semaphore or mutex lock.
- When a thread encounters
#pragma omp critical, it attempts to enter the critical section. - If no other thread is inside any unnamed critical section, it proceeds.
- If another thread is inside any unnamed critical section, the calling thread blocks until that thread exits its critical section.
- Upon exiting the block, the thread releases the "lock," allowing a waiting thread to enter.
Named Critical Sections¶
For finer-grained locking (to avoid unnecessary serialization), you can assign names to critical sections.
#pragma omp critical(name1)
{
// Accesses shared data A
}
#pragma omp critical(name2)
{
// Accesses shared data B
}
- Rule: Only one thread can be active in a critical section with the same name at a time.
- Consequence: A thread can be in
critical(name1)while another is incritical(name2), allowing concurrent access to different data. This increases parallelism compared to a single unnamed critical section.
Advantages and Disadvantages¶
Advantages:
- Easier to Use: Simpler syntax than manual mutex lock/unlock calls. Less boilerplate code.
- Integrated: Part of a unified parallel programming model (data parallelism + synchronization).
- Portable: Works across platforms with OpenMP support.
Disadvantages:
- Programmer Responsibility: The developer must still identify race conditions and correctly place
criticaldirectives. OpenMP doesn't automatically detect shared data hazards. - Deadlock Possible: Just like mutexes, if you have multiple critical sections and threads acquire them in different orders, deadlock can occur.
- Example: Thread 1 enters
critical(A)then tries forcritical(B). Thread 2 enterscritical(B)then tries forcritical(A). Deadlock.
- Example: Thread 1 enters
- Performance: Overuse of
critical(especially unnamed) can serialize too much code, reducing parallelism. Named sections help but require careful design.
Relation to Other Synchronization¶
The critical directive is OpenMP's high-level mutual exclusion primitive. OpenMP also provides:
atomicdirective: For single-statement atomic operations (likecounter += value), which can be more efficient thancriticalfor simple operations.- Locks API: Lower-level
omp_lock_twithomp_set_lock()andomp_unset_lock()functions for more control (similar to standard mutexes).
Summary: OpenMP's critical directive provides a straightforward, compiler-assisted way to enforce mutual exclusion in parallel regions. It simplifies synchronization but still requires the programmer to correctly identify critical sections and avoid deadlock. It represents a structured, high-level alternative to manual Pthreads or Java-style locking.
7.5.3 Functional Programming Languages¶
Imperative vs. Functional Paradigms¶
Most mainstream languages (C, C++, Java, C#) are imperative (or procedural) languages.
- Key Trait: They are state-based. Algorithms rely on a sequence of statements that change the program's mutable state (variables, data structures).
- Concurrency Consequence: Mutable shared state is the root cause of synchronization problems—race conditions, deadlocks, and the need for locks arise directly from multiple threads modifying the same data.
Functional programming languages represent a fundamentally different paradigm.
Core Principle: Immutability and Statelessness¶
In functional languages:
- Immutable Data: Once a variable or data structure is created and assigned a value, it cannot be changed (no assignment statements after initialization).
- No Mutable State: Programs are built by composing functions that take inputs and produce outputs without modifying any external state.
- Functions as First-Class Citizens: Functions can be passed as arguments, returned from other functions, and assigned to variables.
Why This Eliminates Synchronization Problems¶
Since data is immutable:
- Race Conditions are Impossible: If data cannot be modified, multiple threads can read the same data concurrently without risk. There are no write conflicts.
- Deadlocks Cannot Occur: Locks are unnecessary because there is no mutable state to protect. No locks means no circular wait, no hold-and-wait.
- Determinism: Pure functions (those with no side effects) always produce the same output for the same input, making reasoning about concurrency much simpler.
Concurrency in functional languages typically involves:
- Implicit Parallelism: Operations on independent data can be automatically parallelized by the runtime.
- Message Passing: Threads/processes communicate by sending immutable messages (e.g., in Erlang's actor model), not by sharing memory.
Example Languages¶
Erlang:
- Designed for highly concurrent, distributed, fault-tolerant systems (e.g., telecommunications).
- Uses an actor model: lightweight processes communicate via asynchronous message passing. Each process has its own state, but that state is private and immutable within the process.
- No shared memory between processes, eliminating traditional synchronization needs.
Scala:
- A hybrid language that blends functional and object-oriented paradigms.
- Runs on the JVM and can interoperate with Java.
- Encourages immutability (e.g.,
valfor immutable variables, immutable collections) but allows mutability if needed. - Provides powerful concurrency abstractions like actors (via the Akka library) and software transactional memory.
Implications for Concurrent Programming¶
- Shift in Mindset: Instead of thinking about protecting shared state with locks, you design systems as collections of independent processes or transformations on immutable data streams.
- Easier Reasoning: Without side effects and mutable state, code is easier to test, debug, and parallelize.
- Performance Considerations: Immutability can lead to increased memory usage (creating new copies instead of modifying in-place). However, persistent data structures allow sharing of common parts between old and new versions, mitigating this cost.
Relation to Previous Topics¶
Functional programming represents the most radical departure from the synchronization problems discussed in this chapter. While transactional memory and OpenMP are improvements within the imperative paradigm, functional languages avoid the problem entirely by removing mutable shared state. They don't need mutexes, semaphores, or monitors for thread safety of data access.
Summary: Functional programming languages like Erlang and Scala address concurrency challenges by eliminating mutable shared state through immutability. This fundamentally removes the possibility of race conditions and deadlocks, offering a different, often simpler, approach to building concurrent systems. Their growing relevance is directly tied to the need for safer, more scalable parallel programming on multicore architectures.
7.6 Summary¶
This chapter applied the synchronization tools from Chapter 6 to practical problems and real-world systems.
Key Concepts:¶
1. Classic Synchronization Problems¶
Bounded-Buffer (Producer-Consumer) Problem:
- Models coordinated data exchange with limited buffer space.
- Requires mutual exclusion for buffer access and condition synchronization for empty/full states.
- Solved with semaphores (
mutex,empty,full) or monitors.
Readers-Writers Problem:
- Models concurrent read access with exclusive write access.
- Has variations: First (reader-priority) and Second (writer-priority), both prone to starvation.
- Solutions track reader count; the first reader locks writers out, the last reader lets them in.
Dining-Philosophers Problem:
- Models deadlock-free allocation of multiple resources.
- A naive semaphore-per-chopstick solution leads to circular-wait deadlock.
- Remedies: limit concurrent actors, atomic acquisition, or asymmetric acquisition order.
- Monitor solution ensures a philosopher eats only if both neighbors aren't eating.
2. Real Operating System Synchronization¶
Windows:
- Uses dispatcher objects (mutexes, semaphores, events, timers) with a signaled/nonsignaled state model.
- Provides efficient critical-section objects (user-mode mutexes that spin briefly then block).
- Kernel uses interrupt masking (UP) and spinlocks (SMP).
Linux:
- Provides atomic integers, mutex locks, semaphores, spinlocks, and reader-writer variants.
- On SMP, uses spinlocks; on UP, replaces them with disabling kernel preemption.
- Tracks locks per task via
preempt_countto prevent preemption while holding a lock.
3. POSIX Synchronization API¶
- Mutex Locks (
pthread_mutex_t):pthread_mutex_lock()/unlock(). - Semaphores (
sem_t):- Named:
sem_open(), global via name, for inter-process communication (IPC). - Unnamed:
sem_init(), requires shared memory for IPC, otherwise thread-local. - Operations:
sem_wait()/sem_post().
- Named:
- Condition Variables (
pthread_cond_t):- Always used with an associated mutex.
pthread_cond_wait()atomically releases mutex and sleeps.pthread_cond_signal()wakes one waiter.- Must wait in a
whileloop to re-check condition.
4. Java Synchronization¶
- Original Monitors: Every object has an intrinsic lock.
synchronizedmethods/blocks, withwait()/notify()for condition synchronization. - Java 5+ Enhancements:
ReentrantLock: More flexible than intrinsic locks (fairness,tryLock(), etc.). Mustunlock()infinally.Semaphore: Counting semaphore (acquire()/release()). Release infinally.Condition: Multiple condition variables perReentrantLock(await()/signal()), solving the single-condition limitation of intrinsic monitors.
5. Alternative Approaches¶
- Transactional Memory:
atomic{ S }blocks. System ensures atomicity, no deadlock from locks.- Implemented in Software (STM) or Hardware (HTM).
- OpenMP:
#pragma omp criticaldirective for mutual exclusion in parallel regions.- Easier than manual locks but still requires programmer to identify critical sections.
- Functional Programming Languages (e.g., Erlang, Scala):
- Immutability eliminates mutable shared state.
- No race conditions or deadlocks; concurrency via message passing or data parallelism.
- Represents a paradigm shift away from state-based synchronization problems.
Overall Takeaway¶
Synchronization is a multifaceted challenge addressed at different levels: from classic theoretical problems, to OS kernel implementations, to standardized APIs (POSIX, Java), to emerging paradigms (TM, functional). The choice of tool depends on the context: the problem domain, performance needs, system constraints, and programming language. Understanding these various approaches equips you to design correct and efficient concurrent systems.
Chapter 8: Deadlocks¶
Introduction to Deadlocks¶
In a multiprogramming environment where multiple threads run concurrently, threads often need to use various system resources (like CPU cycles, files, or printers). A thread must request a resource, use it, and then release it. Problems arise when a thread requests a resource that isn't currently available—it must wait.
A deadlock occurs when a set of threads becomes permanently blocked because each thread in the set is holding a resource and waiting for another resource that is held by a different thread in the same set. It's a permanent state of waiting in a cycle. You saw this briefly in Chapter 6 as a "liveness failure"—a situation where a program fails to make progress.
A Simple Analogy (The Kansas Train Law): The book gives a perfect real-world example: an old Kansas law stated, "When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone." This creates a deadlock: Train A is waiting for Train B to go, and Train B is waiting for Train A to go. Neither can proceed.
Why is This Chapter Important? While operating systems provide the tools for concurrency (like locks), they typically do not automatically prevent or resolve deadlocks. It is the programmer's responsibility to design applications that avoid deadlock situations. This challenge is growing as we write more concurrent software for multicore systems.
8.1 System Model¶
What is a Resource?¶
A computer system has a finite number of resources to be shared among competing threads. Resources are categorized into types (or classes), and each type can have multiple, identical instances.
- Examples of Resource Types: CPU cycles, memory blocks, files, I/O devices (printers, network interfaces, DVD drives).
- Example of Instances: If a system has four CPUs, the resource type "CPU" has four instances. If it has two network cards, the resource type "network interface" has two instances.
- Key Point: For the purposes of resource allocation, all instances of a given type are considered identical. If a thread requests a printer, any available printer should satisfy the request. If not, then the resource types have been defined incorrectly (e.g., a color printer and a black-and-white printer should be different resource types).
Synchronization Tools as Resources¶
The synchronization tools from Chapter 6 (mutex locks and semaphores) are also system resources. In modern programming, they are the most common source of deadlocks.
- Definition: Each lock is typically associated with protecting a specific data structure (e.g., one lock for a queue, another for a linked list). Therefore, each unique lock is considered its own resource class instance.
- Scope of Discussion: This chapter focuses on kernel-managed resources and deadlocks within a single process's threads. Deadlocks can also occur between processes using interprocess communication (IPC), but those are not managed by the kernel and are not covered here.
The Standard Resource-Use Sequence¶
A thread must interact with a resource in a strict, three-step sequence:
- Request: The thread asks for the resource. If the resource is available, it is granted immediately. If not (e.g., a mutex lock is held by another thread), the requesting thread must wait until it can acquire the resource.
- Use: The thread operates on the resource. For a mutex lock, this means entering the critical section. For a printer, this means sending data.
- Release: The thread gives up the resource so it becomes available for other threads.
How the Operating System Manages Resources¶
- System Calls: The request and release operations are often implemented as system calls (e.g.,
open()/close()for files,allocate()/free()for memory,wait()/signal()on semaphores,acquire()/release()on mutex locks). - System Tables: The operating system maintains a system table to track all resources. For each resource, it records:
- Whether it is free or allocated.
- If allocated, which thread it is allocated to.
- Wait Queues: If a thread requests a resource currently allocated to another thread, the OS can place the requesting thread into a queue of threads waiting for that specific resource.
Formal Definition of a Deadlocked State¶
A set of threads is in a deadlocked state when every thread in the set is waiting for an event that can be caused only by another thread in the same set.
- The events we are concerned with are primarily resource acquisition and release.
- While we focus on logical resources (locks, semaphores, files), deadlocks can also arise from other events like data arriving on a network socket.
Classic Illustration: The Dining Philosophers (Recap from Section 7.1.3)¶
Go to Figure 7.1 (the dining philosophers diagram) for a visual.
- Resources: The chopsticks.
- Scenario: All five philosophers get hungry simultaneously. Each philosopher follows the rule: "Pick up the chopstick on your left first."
- Result: Each philosopher successfully grabs the chopstick on her left. Now, no chopsticks remain available. Every philosopher is now waiting, forever, for the chopstick on her right (which is held by the philosopher to her right) to become available. This is a perfect deadlock.
The Programmer's Responsibility¶
Developers use synchronization tools (locks) to prevent race conditions. However, in acquiring and releasing these locks, if the order is not carefully managed, it can lead to deadlock. The operating system does not automatically prevent this; it is the developer's duty to design deadlock-free programs.
8.2 Deadlock in Multithreaded Applications¶
Before we learn how to identify and manage deadlocks, let's see a concrete example of how one can occur in a real multithreaded program using POSIX mutex locks (Pthreads).
1. Mutex Lock Basics (Pthreads Refresher)¶
pthread_mutex_init()initializes a mutex lock in an unlocked state.pthread_mutex_lock()is called by a thread to acquire a mutex. If the mutex is already locked by another thread, the calling thread blocks (waits) until the mutex becomes available.pthread_mutex_unlock()is called by a thread to release a mutex it holds, making it available for other threads.
2. The Deadlock Scenario Setup¶
The code creates and initializes two mutex locks:
pthread_mutex_t first_mutex;
pthread_mutex_t second_mutex;
pthread_mutex_init(&first_mutex, NULL);
pthread_mutex_init(&second_mutex, NULL);
Two threads are then created:
thread_oneexecutes the functiondo_work_one().thread_twoexecutes the functiondo_work_two(). Both threads need access to both mutex locks to do their work.
3. The Problem: Opposing Lock Acquisition Order¶
Go to Figure 8.1 to see the exact code.
Thread One's Lock Order:
pthread_mutex_lock(&first_mutex);// Acquires first_mutexpthread_mutex_lock(&second_mutex);// Then tries to acquire second_mutex
Thread Two's Lock Order (OPPOSITE):
pthread_mutex_lock(&second_mutex);// Acquires second_mutexpthread_mutex_lock(&first_mutex);// Then tries to acquire first_mutex
4. How the Deadlock Happens (Step-by-Step)¶
This sequence leads to deadlock:
- Thread One acquires
first_mutex. - Thread Two acquires
second_mutex. (These first two steps can happen concurrently). - Now:
- Thread One tries to acquire
second_mutex. It blocks because Thread Two holds it. - Thread Two tries to acquire
first_mutex. It blocks because Thread One holds it.
- Thread One tries to acquire
- Result: Both threads are now permanently blocked, waiting for each other. Thread One is waiting for a resource (
second_mutex) held by Thread Two, and Thread Two is waiting for a resource (first_mutex) held by Thread One. This is a deadlock.
5. The Insidious Nature of Deadlocks¶
Important Note: Deadlock may not always happen. It depends entirely on CPU scheduling timing (a race condition).
- If Thread One runs quickly enough to acquire and release both
first_mutexandsecond_mutexbefore Thread Two even starts trying to acquire locks, then no deadlock occurs. - If the threads are interleaved just right (as described in step 4), deadlock occurs.
6. Key Takeaway for Developers¶
This example highlights the core challenge with deadlocks:
Deadlocks are non-deterministic and can be incredibly difficult to identify and test for, because they may only manifest under very specific, timing-dependent scheduling scenarios. A program might run correctly 99 times out of 100 and deadlock on the 100th run, making debugging very hard.
The responsibility is on the programmer to design locking protocols that avoid the possibility of such circular waiting, which we will explore in the following sections.
8.2.1 Livelock¶
What is Livelock?¶
Livelock is another type of liveness failure, meaning the program fails to make forward progress. It is similar to deadlock in that two or more threads are prevented from proceeding, but the reason for the lack of progress is different.
- Deadlock: Threads are blocked, passively waiting for events that will never happen.
- Livelock: Threads are not blocked. They are actively executing code, but their actions continuously fail and result in no overall progress. They are "running in place."
The Hallway Analogy¶
Imagine two people trying to pass each other in a narrow hallway:
- Person A moves to their right, while Person B moves to their left. They are still facing each other.
- Seeing the obstruction, Person A moves to their left, while Person B moves to their right. They are still facing each other.
- This dance repeats endlessly. They are not stuck standing still (not blocked), but they are making no progress down the hallway. This is livelock.
Livelock in Code: The pthread_mutex_trylock() Example¶
Livelock can be illustrated by modifying the previous deadlock example using the pthread_mutex_trylock() function.
- What
pthread_mutex_trylock()does: It attempts to acquire a mutex lock, but if the lock is unavailable, it fails immediately (returns an error code) instead of blocking. This allows a thread to do other work instead of waiting.
Go to Figure 8.2 (which rewrites the code from Figure 8.1).
How Livelock Occurs in this Scenario:¶
- Thread One acquires
first_mutex. - Thread Two acquires
second_mutex. - The Loop Begins:
- Thread One calls
pthread_mutex_trylock(&second_mutex). This fails because Thread Two holds it. Thread One then releasesfirst_mutex. - Thread Two calls
pthread_mutex_trylock(&first_mutex). This fails because Thread One just held it. Thread Two then releasessecond_mutex.
- Thread One calls
- Both threads loop and try again. Now, they might acquire the opposite mutexes first (Thread One gets
second_mutex, Thread Two getsfirst_mutex), and thetrylockon the other mutex will fail again, causing them to release and restart. - Result: The threads are actively running code, releasing and acquiring locks, but they never successfully acquire both locks at the same time to do their actual work. They are livelocked.
How to Avoid Livelock¶
Livelock typically happens because threads retry failing operations in perfect synchronization.
- The Solution: Break the synchronization by introducing randomness.
- Technique: When an operation (like
trylock) fails, have the thread wait for a random duration before retrying. This makes it highly likely that one thread will succeed while the other is waiting. - Real-World Example: Ethernet's Exponential Backoff. When two network hosts detect a packet collision, they do not retransmit immediately. Instead, each host waits for a randomly chosen backoff period before trying again. This protocol is specifically designed to avoid livelock on the network.
Summary of Livelock¶
- Less common than deadlock but remains a challenging issue in concurrent application design.
- Like deadlock, it may only occur under specific, timing-dependent scheduling circumstances, making it difficult to test for and reproduce.
- The key differentiator from deadlock: threads are in an active, endless loop of failing operations, not a passive wait state.
8.3 Deadlock Characterization¶
We have seen how deadlock happens in code. Now, we formally define the four necessary conditions that must all be present simultaneously for a deadlock to occur. Understanding these conditions is crucial because preventing deadlock involves negating at least one of them.
8.3.1 Necessary Conditions for Deadlock¶
A deadlock situation can arise if and only if all four of the following conditions hold in a system:
1. Mutual Exclusion¶
Definition: At least one resource must be held in a non-sharable mode. This means that only one thread can use the resource at a time.
- Consequence: If another thread requests that resource, it must be delayed until the current holder voluntarily releases it.
- Example: A mutex lock is the perfect example. If Thread A has locked
mutex_X, Thread B cannot usemutex_Xuntil Thread A unlocks it.
2. Hold and Wait¶
Definition: A thread must be holding at least one resource already and simultaneously be waiting to acquire one or more additional resources that are currently held by other threads.
- Consequence: Threads are not just waiting idly; they are holding resources hostage while waiting for more. This prevents other threads from using the held resources.
- Example (from Figure 8.1): Thread One holds
first_mutex(it's holding) and is waiting forsecond_mutex(it's waiting). Thread Two holdssecond_mutexand is waiting forfirst_mutex.
3. No Preemption¶
Definition: Resources cannot be forcibly taken away (preempted) from a thread. A resource can only be released voluntarily by the thread holding it, after that thread has finished its task.
- Consequence: You cannot simply break a deadlock by grabbing a lock from one thread and giving it to another. The thread must give it up on its own.
- Contrast with CPU Preemption: The CPU itself can be preempted (taken away) by the OS scheduler, but resources like locks and I/O devices typically operate under a no-preemption policy.
4. Circular Wait¶
Definition: There must exist a closed chain (circle) of waiting threads. A set of waiting threads {T0, T1, ..., Tn} exists such that:
T0is waiting for a resource held byT1.T1is waiting for a resource held byT2.- ...
Tn-1is waiting for a resource held byTn.Tnis waiting for a resource held byT0.- Consequence: Every thread in the circle is waiting for the next one, forming a cycle of dependency with no start or end point. No thread can proceed because each is waiting for another in the circle.
- Example (from Figure 8.1): The two-thread case: Thread One (T1) waits for a resource held by Thread Two (T2), and Thread Two (T2) waits for a resource held by Thread One (T1). This is a circular wait with two threads.
Important Notes on the Conditions¶
- All Four Are Necessary: Deadlock cannot occur if even one of these conditions is absent. Therefore, to prevent deadlock, we need to ensure at least one condition is never allowed to hold.
- Dependency Between Conditions: The Circular Wait condition actually implies the Hold and Wait condition (if there's a circle of threads waiting, each thread in that circle must be holding a resource while waiting). So, the four conditions are not completely independent.
- Why Separate Them Anyway? As we will see in Section 8.5, it is still very useful to think about them separately because each one suggests a different strategy for attacking the deadlock problem (prevention, avoidance, detection).
Visualizing Deadlock: The Resource-Allocation Graph¶
Go to Figure 8.3. This figure presents a Resource-Allocation Graph for the deadlock program in Figure 8.1.
- What it shows: A visual model with two types of nodes:
- Thread Nodes (represented as circles, e.g.,
thread_one,thread_two). - Resource Nodes (represented as rectangles, e.g.,
first_mutex,second_mutex). Dots inside the rectangle represent instances (one dot per instance).
- Thread Nodes (represented as circles, e.g.,
- How to read it:
- An assignment edge (resource → thread) means the resource is currently held by that thread.
- A request edge (thread → resource) means the thread is waiting for that resource.
- Deadlock Indicator: The graph in Figure 8.3 has a cycle (
thread_one→second_mutex→thread_two→first_mutex→thread_one). If every resource in the cycle has only one instance (like a mutex), then the presence of a cycle in the graph means there is a deadlock.
8.3.2 Resource-Allocation Graph¶
The Resource-Allocation Graph (RAG) is a directed graph that gives us a precise, visual model to describe the state of resource allocation and requests in a system. It is a crucial tool for analyzing potential deadlocks.
Graph Structure: Vertices (V) and Edges (E)¶
The graph consists of a set of vertices V and a set of edges E.
Vertices (V): Divided into two disjoint sets:
- Thread Nodes, T = {T₁, T₂, ..., Tₙ}: Represented as circles. These are all the active threads in the system.
- Resource Type Nodes, R = {R₁, R₂, ..., Rₘ}: Represented as rectangles. These are all the different resource types in the system (e.g., Printer, Scanner, Mutex_A). Inside each rectangle, we draw a number of dots to represent the instances of that resource type (e.g., 2 dots for a system with 2 printers).
Edges (E): Directed edges (arrows) represent relationships between threads and resources.
- Request Edge: A directed edge from thread Tᵢ to resource type Rⱼ (Tᵢ → Rⱼ). This means thread Tᵢ has requested an instance of resource type Rⱼ and is currently waiting for it.
- Assignment Edge: A directed edge from resource type Rⱼ to thread Tᵢ (Rⱼ → Tᵢ). This means an instance of resource type Rⱼ has been allocated to (is currently held by) thread Tᵢ. Importantly, this edge must originate from a specific dot inside the resource rectangle to show which exact instance is assigned.
Dynamic Behavior: How the Graph Changes¶
The graph is not static; it evolves as threads act:
- On a Request: When thread Tᵢ requests an instance of resource Rⱼ, a request edge (Tᵢ → Rⱼ) is added to the graph.
- On a Grant: When the request can be fulfilled (an instance becomes available), the request edge is instantaneously transformed into an assignment edge (Rⱼ → Tᵢ) from an available dot.
- On a Release: When thread Tᵢ releases an instance of resource Rⱼ, the assignment edge (Rⱼ → Tᵢ) is deleted from the graph.
Interpreting the Graph: Cycles and Deadlock¶
The core purpose of the graph is to analyze the presence of deadlock by looking for cycles.
Case 1: Graph Contains NO Cycles¶
- Conclusion: If the resource-allocation graph contains no cycles, then no thread in the system is deadlocked. The system is in a safe state.
Case 2: Graph Contains a Cycle - Single-Instance Resources¶
- Scenario: Each resource type involved in the cycle has exactly one instance (only one dot per rectangle).
- Conclusion: A cycle is both a necessary and sufficient condition for deadlock. If you see a cycle, a deadlock has definitely occurred. Every thread in the cycle is deadlocked.
- Example: Go to Figure 8.3. This is the simple 2-thread, 2-mutex deadlock. Each mutex is a single-instance resource. The cycle
T1→R2→T2→R1→T1means deadlock.
Case 3: Graph Contains a Cycle - Multi-Instance Resources¶
- Scenario: The cycle involves resource types that have several instances (multiple dots per rectangle).
- Conclusion: A cycle is a necessary but NOT sufficient condition for deadlock. A cycle may indicate a deadlock, but it might not. Further analysis is needed.
- Why? Because even with a cycle, one of the threads not in the cycle might release an instance, which could then be allocated to a waiting thread in the cycle, breaking the cycle.
Detailed Graph Analysis with Examples¶
Example 1: Figure 8.4 (A Snapshot, No Deadlock)
Go to Figure 8.4. Let's decode its state:
- Vertices: T = {T₁, T₂, T₃}; R = {R₁, R₂, R₃, R₄}.
- Edges: E = {T₁→R₁, T₂→R₃, R₁→T₂, R₂→T₂, R₂→T₁, R₃→T₃}.
- Resource Instances:
- R₁: 1 instance (1 dot)
- R₂: 2 instances (2 dots) → Both are allocated (to T₁ and T₂).
- R₃: 1 instance (1 dot) → Allocated to T₃.
- R₄: 3 instances (3 dots) → All are free (no edges to/from R₄).
- Thread States:
- T₁: Holds R₂, waits for R₁.
- T₂: Holds R₁ and R₂, waits for R₃.
- T₃: Holds R₃.
Analysis of Figure 8.4: There is no cycle in this graph. Therefore, the system is not deadlocked. T₃ is running (holds R₃, no waits). T₁ and T₂ are waiting, but they are not in a circular wait with each other and T₃.
Example 2: Figure 8.5 (Deadlock with a Cycle) Go to Figure 8.5. This is Figure 8.4 after we add one request: Thread T₃ requests an instance of R₂.
- New edge added: T₃ → R₂ (a request edge).
- Now cycles exist:
- T₁ → R₁ → T₂ → R₃ → T₃ → R₂ → T₁
- T₂ → R₃ → T₃ → R₂ → T₂
- Is this a deadlock? Check resources in the cycles:
- R₁ has 1 instance.
- R₃ has 1 instance.
- R₂ has 2 instances, but both are already allocated (to T₁ and T₂). There are no free instances left.
- Conclusion: Since R₁ and R³ are single-instance and part of the cycle, and all instances of R₂ are held, this cycle cannot be broken. Threads T₁, T₂, and T₃ are deadlocked. The cycle here is sufficient to declare deadlock.
Example 3: Figure 8.6 (Cycle but NO Deadlock)
Go to Figure 8.6. This graph has a cycle: T₁ → R₁ → T₃ → R₂ → T₁.
- Check resources:
- R₁: 1 instance (held by T₃).
- R₂: Has multiple instances. One is held by T₁, one is held by T₄ (a thread outside the cycle).
- Analysis: Thread T₄, which is not in the cycle, holds an instance of R₂. If T₄ finishes and releases its R₂ instance, that freed instance can be allocated to T₁ (who is waiting for it). This would transform the request edge T₁→R₂ into an assignment edge R₂→T₁, breaking the cycle.
- Conclusion: Even though a cycle exists, the system is not necessarily deadlocked because a thread outside the cycle (T₄) holds a resource that can break the cycle. The cycle is not sufficient for deadlock here.
Summary of Resource-Allocation Graph Rules¶
- No Cycle → No Deadlock. (Safe state).
- Cycle + All involved resources are single-instance → Deadlock. (Cycle is necessary and sufficient).
- Cycle + Some involved resources have multiple instances → Deadlock may exist. (Cycle is necessary but not sufficient). You must check if existing allocations make the cycle unbreakable.
This distinction is vital for designing deadlock detection algorithms (Section 8.6), which must differentiate between cycles that are deadly and those that are not.
8.4 Methods for Handling Deadlocks¶
When faced with the deadlock problem, a system designer can choose from three fundamental strategies. The choice involves a trade-off between performance, convenience, and safety.
The Three General Approaches¶
1. The Ostrich Algorithm: Ignore Deadlocks Entirely¶
- Definition: Pretend that deadlocks never occur. Take no explicit action in the OS to prevent, avoid, or detect them.
- Rationale: This approach is based on a cost-benefit analysis. Implementing deadlock handling adds overhead and complexity. If deadlocks are very rare (e.g., once a month or year) and the cost of a system hang is acceptable (a manual reboot), it may be cheaper and simpler to ignore the problem.
- Consequences: If a deadlock does occur, it will be undetected. The system's performance will degrade (resources locked up, threads stalled) until it eventually stops functioning completely, requiring a manual restart.
- Who Uses This? This is the approach used by most general-purpose operating systems, including Linux and Windows. Responsibility is pushed to kernel and application developers to write programs that avoid deadlocks using proper design.
2. Ensure Deadlocks Never Happen: Prevention or Avoidance¶
This proactive approach ensures the system never enters a deadlocked state. It has two sub-categories:
Deadlock Prevention (Section 8.5): This is a static, structural approach. We design the system so that at least one of the four necessary conditions for deadlock (Mutual Exclusion, Hold and Wait, No Preemption, Circular Wait) is guaranteed never to hold. This involves imposing constraints on how resources can be requested (e.g., a thread must request all resources at once). Prevention is often restrictive and can lead to lower resource utilization and system throughput.
Deadlock Avoidance (Section 8.6): This is a dynamic, runtime approach. It requires the OS to have additional advance knowledge about each thread's future resource needs (maximum claims). For every resource request, the OS performs a safety check. It grants the request only if doing so leaves the system in a "safe state" (a state from which all threads can still possibly finish). If granting the request would lead to an "unsafe state" (where deadlock could occur later), the requesting thread must wait. Avoidance is less restrictive than prevention but requires more runtime overhead and information.
3. Allow Deadlocks, Then Deal With Them: Detection and Recovery¶
- Definition: Allow the system to enter a deadlocked state. Periodically, run a deadlock detection algorithm (Section 8.7) to examine the system state (e.g., using a resource-allocation graph) to determine if a deadlock has occurred. If a deadlock is found, employ a recovery algorithm (Section 8.8) to break it.
- Recovery Methods: May involve:
- Process/Termination: Aborting one or more deadlocked threads.
- Resource Preemption: Forcibly taking resources from some threads and giving them to others (requires rollback and careful handling).
- Who Uses This? Systems where deadlocks, while undesirable, are an acceptable risk if they can be resolved automatically. This is common in database management systems, where the cost of prevention/avoidance is too high, but automatic detection and recovery are feasible.
Hybrid Approaches and Important Considerations¶
- Combining Strategies: No single approach is perfect for all resource types in an OS. A practical system may use a hybrid approach: use prevention for some resources, avoidance for others, and detection/recovery for a third class. The goal is to select the optimal cost/benefit strategy for each class.
- Deadlocks vs. Other Liveness Failures: The manual recovery methods a system already needs for other liveness failures (like a high-priority real-time thread hogging the CPU in a non-preemptive system, or livelock) can often be reused for deadlock recovery. This makes the "detect and recover" strategy more attractive, as it shares infrastructure with other necessary system functions.
- The Developer's Role: In systems using the Ostrich Algorithm (like standard Linux/Windows), the burden falls on you, the programmer. When writing multithreaded applications using synchronization tools (mutexes), you must employ techniques derived from deadlock prevention (like lock ordering) to ensure your programs are deadlock-free. The OS will not save you.
Decision Flowchart¶
In summary, the choice depends on:
- How often do deadlocks occur? (Rare → Ignore)
- What is the cost of a deadlock? (High → Use Prevention/Avoidance)
- What is the performance overhead we can tolerate? (Low overhead → Ignore; Some overhead → Detection/Recovery; Higher overhead → Prevention/Avoidance)
- Do we have knowledge of future resource needs? (Yes → Possible Avoidance; No → Prevention or Detection)
The following sections (8.5 through 8.8) will now dive into the detailed algorithms for Prevention, Avoidance, Detection, and Recovery.
8.5 Deadlock Prevention¶
Deadlock prevention is a strict, static strategy that aims to design the system in such a way that it is impossible for a deadlock to ever occur. Recall the four necessary conditions from Section 8.3.1: Mutual Exclusion, Hold and Wait, No Preemption, and Circular Wait.
The Prevention Strategy: For each condition, we devise a protocol that guarantees that condition does NOT hold in the system. If we can negate at least one condition, deadlock becomes impossible. We will now examine each condition and see how we might attack it.
8.5.1 Attacking the Mutual Exclusion Condition¶
- Goal: Make it so that resources do not require mutual exclusive access; i.e., make all resources sharable.
- Reality Check: This is generally impossible for most physical and logical resources central to deadlock problems.
- Explanation:
- Sharable Resources: Some resources are inherently sharable and thus cannot cause deadlock. A classic example is a read-only file. Multiple threads can read from it simultaneously without issue. They never need to wait for exclusive access.
- Nonsharable Resources: The resources that cause deadlock (like mutex locks, write-access to files, printers) are intrinsically nonsharable. You cannot allow two threads to write to the same file location at the same time without causing corruption. A mutex lock's entire purpose is to provide mutual exclusion for a critical section.
- Conclusion: Denying the mutual exclusion condition is not a viable general-purpose strategy for deadlock prevention because the core resources we need to manage (synchronization primitives, writeable data) fundamentally require it. We must look to the other three conditions.
8.5.2 Attacking the Hold and Wait Condition¶
Goal: Ensure that a thread can never hold one resource while waiting for another. This breaks the "hold and wait" state.
Method 1: Total Allocation (All Resources Upfront)
- Protocol: Require each thread to request and be allocated all the resources it will need for its entire execution before it is allowed to start. The thread holds all resources from the beginning.
- Problems:
- Impractical: It is often impossible for a thread to know all resources it will need in advance. Resource needs are usually dynamic (e.g., a thread may need to allocate memory based on user input).
- Severely Lowers Resource Utilization: Resources are tied up (held idle) for the entire duration of the thread's run, even if they are only needed for a short period at the beginning or end. This can lead to massive waste.
- Starvation: A thread that needs many popular resources may wait forever because it's very unlikely that all of them will be free simultaneously.
Method 2: All-or-No Resources (Release Before New Request)
- Protocol: Allow a thread to hold resources, but with a strict rule: A thread can only request new resources if it currently holds none***. This means before requesting any new resource, a thread must **first release *all resources it currently holds.
- How it Breaks Hold & Wait: A thread is either in a state where it holds resources (and is not waiting) or it is waiting for resources (and holds none). It can never be in the hybrid "hold and wait" state.
- Problems:
- Low Resource Utilization & Potential Starvation: Similar to Method 1. A thread needing two resources
AandBmust get them simultaneously. IfAis free butBis not, it must releaseAand try again later for both, wasting the opportunity to useA. - Practical Inefficiency: A thread that constantly needs a core resource (like a lock protecting a frequently accessed data structure) would have to repeatedly acquire and release it, losing progress and adding overhead.
- Low Resource Utilization & Potential Starvation: Similar to Method 1. A thread needing two resources
Summary of Attacking Hold & Wait: While theoretically possible, protocols to negate this condition are highly restrictive, lead to poor performance and low resource utilization, and can cause starvation. They are rarely used in practice for general-purpose systems.
8.5.3 Attacking the No Preemption Condition¶
- Goal: Break the rule that resources cannot be forcibly taken away. Introduce resource preemption.
- Core Idea: If a thread cannot get what it needs immediately, we can forcibly take away (preempt) the resources it already holds. This breaks a potential deadlock by forcing a thread to release resources so others can use them.
Two main protocols implement this idea:
Protocol 1: Preempt from the Requester¶
- Scenario: Thread is holding resources and requests a new resource that is not immediately available.
- Action: All resources currently held by the requesting thread are implicitly preempted (released). These preempted resources are added to the list of resources the thread is now waiting for.
- Restart Condition: The thread is not restarted until it can be allocated both the new resource it just requested and all of its preempted old resources.
- Effect: The thread is temporarily rolled back to a state where it holds nothing, freeing its resources for others.
Protocol 2: Preempt from the Holder (Victim Selection)¶
- Scenario: Thread requests a resource.
- If available, allocate it.
- If not available, check: Is it currently held by another thread that is itself waiting for some other resource? (i.e., is the holder in a "hold and wait" state?).
- If YES, then preempt the desired resource from that waiting thread and give it to the requesting thread.
- If the resource is neither available nor held by a waiting thread, the requesting thread must simply wait.
- While Waiting: A waiting thread is vulnerable to preemption. If another thread requests a resource it holds, that resource may be preempted from it.
- Restart Condition: A thread can restart only when it is allocated all the new resources it requested and has recovered any resources that were preempted from it while it was waiting.
Practical Application and Severe Limitations¶
- Suitable Resources: This protocol can only be applied to resources whose state can be easily saved and restored without catastrophic loss of work.
- Good Examples: CPU registers (context switch saves/restores them), database transactions (they can be aborted and rolled back), memory pages (can be swapped to disk).
- Unsuitable Resources (The Big Problem): The protocol cannot be applied to the most common sources of deadlock: mutex locks and semaphores.
- Why? Preempting a lock is conceptually meaningless and dangerous. If you forcibly take a lock away from a thread that is in the middle of a critical section, you leave the protected data structure in an inconsistent, possibly corrupted state. The thread cannot be rolled back to a point before it acquired the lock without complex and expensive transaction support.
- Consequence: Since deadlocks most frequently involve synchronization primitives (locks), and preemption is not viable for them, attacking the "No Preemption" condition is often not a practical general solution for deadlock prevention in multithreaded programming.
Summary of Attacking No Preemption: While a valid theoretical approach for certain resource types (like CPU cycles or restartable transactions), it fails for the very resources (locks) that are most prone to deadlock. Therefore, like attacking Mutual Exclusion, it is of limited use for preventing the classic deadlocks programmers write in their code.
8.5.4 Attacking the Circular Wait Condition¶
Attacking Mutual Exclusion, Hold and Wait, and No Preemption has proven to be impractical for general use. However, the Circular Wait condition offers a practical and widely used strategy for deadlock prevention.
The Core Idea: Impose a Total Order¶
- Goal: Ensure that a circular chain of waiting threads can never form.
- Method:
- Define a total ordering among all resource types in the system. Assign each resource type a unique integer.
- Enforce a protocol: Threads must request resources in strictly increasing order of their assigned numbers.
Formal Protocol Definition¶
Let the set of resource types be R = {R₁, R₂, ..., Rₘ}. We define a one-to-one function F: R → N (where N is natural numbers) that assigns a unique integer to each resource type.
Protocol Rule: A thread can request resources only in an increasing order of enumeration. More formally:
- A thread can initially request any resource
Rᵢ. - After that, the thread can request a resource
Rⱼif and only ifF(Rⱼ) > F(Rᵢ). - Alternative Rule (Equivalent): A thread requesting
Rⱼmust have released all resourcesRᵢsuch thatF(Rᵢ) ≥ F(Rⱼ). (You must let go of higher/equal-numbered locks before grabbing new ones). - Note: If a thread needs multiple instances of the same resource type, it must request them all at once in a single request.
Applying the Protocol to the Pthread Example¶
Recall Figure 8.1 (the two-mutex deadlock). We can assign an order:
F(first_mutex) = 1F(second_mutex) = 5
The Rule Fixes the Deadlock: Both thread_one and thread_two must now request locks in order first_mutex (1) then second_mutex (5). thread_two's original code (acquiring second_mutex first) is illegal under this protocol. Once we rewrite both threads to acquire in the same order (1 then 5), the circular wait is broken.
Proof: Why This Prevents Circular Wait (By Contradiction)¶
Assume a circular wait does exist despite the protocol. Let the circular wait involve threads {T₀, T₁, ..., Tₙ}, where:
Tᵢis waiting for resourceRᵢ, which is held byTᵢ₊₁. (ForTₙ, it waits forRₙheld byT₀).
Now, analyze each thread Tᵢ₊₁ in the circle:
- It holds resource
Rᵢ. - It is requesting/waiting for resource
Rᵢ₊₁. - By the protocol rule (to hold
Rᵢwhile requestingRᵢ₊₁), we must have:F(Rᵢ) < F(Rᵢ₊₁).
This gives us a chain of inequalities for the entire circle:
F(R₀) < F(R₁) < F(R₂) < ... < F(Rₙ) < F(R₀).
The final term F(R₀) < F(R₀) is a contradiction (a number cannot be less than itself). Therefore, our initial assumption is false: A circular wait cannot exist under this protocol.
Practical Challenges and a Java Solution¶
- Challenge 1: Developer Discipline. The OS does not enforce this. It is up to the application programmer to define the order and write code that follows it.
- Challenge 2: Defining the Order. In a large system with hundreds of locks, defining and maintaining a global order is difficult.
- A Java Trick: Many Java developers use
System.identityHashCode(Object)(which returns a unique hash code based on the object's memory address) as the ordering functionF. The rule becomes: Always acquire locks in order of increasingidentityHashCode()of the lock objects.
- A Java Trick: Many Java developers use
The Dynamic Locking Problem: Figure 8.7¶
Go to Figure 8.7. This shows a critical limitation: lock ordering fails if locks are acquired dynamically based on runtime parameters, unless we are very careful.
- Scenario: A
transaction()function transfers money from one accountfromto anotherto. Each account has its own lock, obtained viaget_lock(). - Deadlock Risk:
- Thread 1 calls
transaction(checking, savings, 25.0). It gets locks in order:lock1 = checking_lock,lock2 = savings_lock. - Thread 2 calls
transaction(savings, checking, 50.0). It gets locks in order:lock1 = savings_lock,lock2 = checking_lock.
- Thread 1 calls
- Result: The two threads acquire locks in opposite orders (
checking→savingsvssavings→checking), creating the potential for the classic deadlock from Figure 8.1. The lock order is not fixed at compile time; it depends on the function arguments.
The Fix (not shown in the book): To apply the ordering protocol here, we must compare the two account locks before acquisition and ensure we always acquire the lock with the lower ordering (e.g., lower identityHashCode) first. This adds a check:
lock1 = get_lock(from);
lock2 = get_lock(to);
if (F(lock1) > F(lock2)) { swap(lock1, lock2); } // Ensure lock1 has lower order
acquire(lock1);
acquire(lock2);
Summary of Attacking Circular Wait: This is the most practical and commonly used technique for deadlock prevention in application programming. It requires developers to impose a total order on resources (locks) and acquire them in that order. It is effective but requires discipline and careful handling of dynamic lock acquisition.
8.6 Deadlock Avoidance¶
Introduction: The Avoidance Strategy¶
Deadlock prevention algorithms are restrictive because they constrain resource requests to break one of the four necessary conditions, often leading to low device utilization and reduced system throughput.
Deadlock avoidance offers a less restrictive alternative. It allows more flexibility but requires additional advance information about thread behavior. The core idea is to be cautious: before granting any resource request, the system performs a runtime check to ensure that granting it will not potentially lead to a deadlock in the future.
- Required Information: The system needs to know the complete future sequence of resource requests and releases for each thread, or at least a summary of it.
- The Decision: For each request, the system considers:
- Resources currently available.
- Resources currently allocated to each thread.
- The future requests and releases of each thread.
- Based on this, it decides: Grant the request now (if it's safe), or make the thread wait (to avoid a potential future deadlock).
The Simplest and Most Useful Model: Maximum Claims¶
The most common model for avoidance requires each thread to declare its maximum need in advance.
- Declaration: Each thread states the maximum number of instances of each resource type that it may need during its entire execution. This is its "claim."
- Guarantee: The thread will never request more than its declared maximum.
- Goal: Using this a priori information, the system can run an algorithm that ensures the system will never enter a deadlocked state. The algorithm dynamically examines the resource-allocation state to guarantee that a circular-wait condition can never exist.
Defining the Resource-Allocation State¶
The algorithm's decision is based on a complete snapshot of the system called the resource-allocation state, which consists of:
- The number of available resources of each type.
- The number of allocated resources of each type to each thread.
- The maximum demands (claims) of each thread.
In the following sections, we will explore two specific deadlock-avoidance algorithms that use this model.
Linux Lockdep Tool (Feature Box)¶
Although ensuring proper lock order is the developer's responsibility, tools can help verify that locks are acquired correctly and detect possible deadlocks. Linux provides such a tool: lockdep.
Purpose: lockdep is a sophisticated runtime locking validator used during development and testing of the Linux kernel (and recently, user applications with Pthreads). It is not for use on production systems, as it adds significant overhead.
How it Works: It monitors all lock acquisitions and releases, building a dynamic graph of lock dependencies. It then checks this graph against a set of rules to detect potential deadlock scenarios.
Two Key Deadlock Detection Capabilities:¶
Lock Ordering Violations:
- lockdep dynamically tracks the order in which locks are acquired.
- If it detects locks being acquired in a different order than they have been in the past, it reports a possible deadlock condition. This enforces a consistent global lock order, which is the deadlock prevention technique from Section 8.5.4.
Incorrect Interrupt Handling with Spinlocks:
- In the kernel, spinlocks (a type of busy-wait lock) are often used in interrupt handlers.
- Deadlock Scenario: Kernel code acquires a spinlock. An interrupt occurs on the same CPU, preempting that kernel code. The interrupt handler tries to acquire the same spinlock, but finds it already held. It spins forever, causing a deadlock (the kernel code can't run to release the lock because it's preempted).
- The Prevention Rule: Kernel code must disable local interrupts on the current processor before acquiring a spinlock that could also be used in an interrupt handler.
- lockdep's Role: It detects when interrupts are enabled while kernel code acquires a lock that is also used in an interrupt handler, and reports this as a possible deadlock scenario.
Impact: Since its introduction in 2006, lockdep has reduced deadlock reports in the Linux kernel by an order of magnitude, proving its effectiveness as a development aid.
For More Information: See the Linux kernel documentation at https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt.
8.6.1 Safe State¶
What is a Safe State?¶
A safe state is a system state (a snapshot of current allocations, availability, and maximum claims) from which the system can guarantee that all threads can complete their work without causing a deadlock, assuming they eventually request resources up to their declared maximum.
The Formal Definition: Safe Sequence¶
A system is in a safe state if and only if there exists at least one safe sequence of threads <T₁, T₂, ..., Tₙ>.
A sequence is safe if for each thread Tᵢ in the sequence, its future resource needs can be satisfied by:
- The resources currently available in the system, plus
- The resources currently held by all threads Tⱼ that appear earlier in the sequence (j < i).
In Practical Terms: Imagine executing threads in the order of the safe sequence.
- For the first thread (T₁) in the sequence: Its remaining needs must be satisfiable immediately with currently free resources.
- Once T₁ finishes (we assume it will), it releases all its resources.
- Now, with the original free resources plus T₁'s resources, the second thread (T₂) must be able to have its needs satisfied.
- This process continues until all threads in the sequence can finish.
If no such ordering of threads exists where each can eventually be satisfied, the system is in an unsafe state.
Visualizing the Relationship: Safe, Unsafe, and Deadlock States¶
Go to Figure 8.8. This Venn diagram is crucial:
- Safe State Space: All system states where at least one safe sequence exists.
- Unsafe State Space: All system states where no safe sequence exists.
- Deadlock State Space: A subset of the unsafe states. Every deadlocked state is unsafe, but not every unsafe state is deadlocked.
Key Insight: An unsafe state is a risk zone. It may lead to deadlock if threads make unfortunate requests. A safe state is a guaranteed deadlock-free zone.
Detailed Example: A System with 12 Resources¶
Let's walk through the book's example step-by-step.
System Setup:
- Total Resources: 12 identical instances of one resource type.
- Threads & Maximum Needs:
- T₀: Maximum 10 resources.
- T₁: Maximum 4 resources.
- T₂: Maximum 9 resources.
- At time t₀ (Initial State):
- Allocated: T₀ holds 5, T₁ holds 2, T₂ holds 2.
- Available (Free) Resources: 12 - (5+2+2) = 3.
- Remaining Needs: Max - Allocated.
- T₀ Needs: 10 - 5 = 5
- T₁ Needs: 4 - 2 = 2
- T₂ Needs: 9 - 2 = 7
Step 1: Is the state at t₀ Safe? Yes. Let's find a safe sequence: < T₁, T₀, T₂ >.
- Check T₁ (Needs 2): Are 2 resources available? Yes (3 available). T₁ can finish. Assume it does, releasing its 2 held resources. New available = 3 (old) + 2 (T₁'s) = 5.
- Check T₀ (Needs 5): Are 5 resources available? Yes (5 available). T₀ can finish. It releases its 5 held resources. New available = 5 (old) + 5 (T₀'s) = 10.
- Check T₂ (Needs 7): Are 7 resources available? Yes (10 available). T₂ can finish.
Since we found a sequence where all threads can complete, the state at t₀ is SAFE.
Step 2: Transition to an Unsafe State. Now, at time t₁, Thread T₂ requests and is allocated 1 more resource.
- New Allocation: T₀=5, T₁=2, T₂=3.
- New Available: 12 - (5+2+3) = 2.
- New Remaining Needs:
- T₀ Needs: 5
- T₁ Needs: 2
- T₂ Needs: 9 - 3 = 6
Is this new state safe? Let's try to find any safe sequence.
- Try T₁ first (Needs 2): 2 available? Yes. T₁ finishes, releases 2. New available = 2 + 2 = 4.
- Now who's next?
- Try T₀ (Needs 5): 4 available? No. T₀ cannot finish next.
- Try T₂ (Needs 6): 4 available? No. T₂ cannot finish next.
- Conclusion: After T₁, neither T₀ nor T₂ can finish with the available resources. There is no safe sequence. Therefore, the state at t₁ is UNSAFE.
The Danger: In this unsafe state, if T₀ now requests its remaining 5 resources (or T₂ requests 6), they will be blocked because resources are unavailable. This could lead to a deadlock (if T₁ finishes, we have 4 free, still not enough for T₀'s 5 or T₂'s 6—both remain blocked forever). The mistake was granting T₂'s request at t₁, which moved the system from a safe to an unsafe state.
How Avoidance Algorithms Use the Safe State Concept¶
The core idea of deadlock avoidance is: Never leave the safe state.
Algorithm Outline:
- Start: Ensure the system is initially in a safe state.
- On each resource request:
- Check if granting the request would leave the system in a safe state.
- If YES: Grant the request immediately.
- If NO: Deny the request and make the thread wait, even if the resource is currently available.
- This guarantees the system always remains in the safe region of Figure 8.8, avoiding deadlocks.
The Trade-off: Lower Resource Utilization¶
A key consequence of this cautious approach is that a thread may be forced to wait for a resource that is physically free. This is because granting it might lead to an unsafe state. This conservative policy means that resources may be idle more often (lower utilization) compared to a system without avoidance, but the trade-off is guaranteed deadlock avoidance.
8.6.2 Resource-Allocation-Graph Algorithm¶
Scope and Prerequisites¶
This is a deadlock avoidance algorithm with a specific limitation: it only works for resource types that have a single instance each (e.g., one printer, one scanner, one specific mutex lock). It is a practical application of the safe state concept using a graphical model.
Extending the Graph: Claim Edges¶
We start with the standard Resource-Allocation Graph (RAG) from Section 8.3.2 and add a new type of edge:
- Claim Edge: A dashed line from a thread
Tᵢto a resourceRⱼ(Tᵢ ─ - → Rⱼ). - Meaning: This edge indicates that thread
Tᵢmay request resourceRⱼat some point in the future. It is a declaration of potential future need (part of the "maximum claim" required for avoidance).
Graph Transformation Rules (The Lifecycle of an Edge)¶
The graph evolves dynamically according to strict rules that reflect the thread's lifecycle with a resource:
- Initialization (A Priori Claiming): Before thread
Tᵢstarts executing, all its claim edges must be added to the graph. This is the algorithm's way of knowing the thread's maximum future needs. (A relaxed rule: a claim edge can be added later, but only ifTᵢcurrently has no assignment edges—meaning it holds no resources.) - On a Request: When thread
Tᵢactually requests resourceRⱼ, the dashed claim edge (Tᵢ ─ - → Rⱼ) is converted into a solid request edge (Tᵢ ───→ Rⱼ). - On a Grant: When the request is granted, the request edge is converted into an assignment edge (Rⱼ ───→ Tᵢ).
- On a Release: When thread
Tᵢreleases resourceRⱼ, the assignment edge is converted back into a dashed claim edge (Tᵢ ─ - → Rⱼ), indicating the thread could request it again in the future.
The Core Safety Rule: No Cycles Allowed¶
The entire deadlock avoidance logic is contained in one rule applied when a thread requests a resource:
A request for resource
Rⱼby threadTᵢcan be granted only if converting the request edgeTᵢ → Rⱼinto an assignment edgeRⱼ → Tᵢwould NOT form a cycle in the resource-allocation graph.
- How to Check: Before granting, we tentatively add the assignment edge to the graph and run a cycle-detection algorithm (complexity O(n²), where n is the number of threads).
- Result:
- If NO cycle is created: Granting the resource leaves the system in a safe state. The request is approved.
- If a cycle IS created: Granting the resource would put the system in an unsafe state. The request is denied, and the thread must wait, even if the resource is currently free.
Why Does a Cycle Mean Unsafe? Recall: with single-instance resources, a cycle in the full graph (including assignment edges) is a sufficient condition for deadlock. A cycle involving claim/request edges represents a potential future deadlock. By blocking the request that would complete the cycle, we prevent the system from ever entering that dangerous, deadlock-prone state.
Illustrated Example¶
Go to Figure 8.9. This shows an initial state. Let's assume it shows:
- Claim edges (dashed lines) indicating future possible requests.
- Some assignment edges (solid lines from resources to threads) showing currently held resources.
Scenario: Thread T₂ requests resource R₂.
- Check: Is R₂ free? Assume it is (no assignment edge from R₂).
- Safety Test: Tentatively convert T₂'s claim edge to R₂ into an assignment edge R₂ → T₂.
- Go to Figure 8.10. This shows the graph after the tentative assignment. A cycle has appeared (e.g., T₁ → ... → R₂ → T₂ → ... → R₁ → T₁).
- Decision: Because a cycle forms, granting the request would lead to an unsafe state. Therefore, the request is DENIED. T₂ must wait for R₂, even though it's free.
The Danger We Avoided: If we had granted T₂'s request (putting us in the unsafe state of Figure 8.10), a subsequent request from T₁ for R₂ and T₂ for R₁ could immediately cause a deadlock (a cycle with only assignment and request edges). The algorithm foresees this chain of events and prevents the first step.
Summary of the Algorithm¶
- Pros: Intuitive, graphical, and directly implements avoidance for single-instance resources.
- Cons:
- Limited Scope: Only works for single-instance resource types.
- Overhead: Requires O(n²) cycle detection on every resource request.
- Static Claims: Requires advance knowledge of all possible resource needs (claim edges).
- It exemplifies the cautious nature of avoidance: It sometimes says "no" to a currently available resource to maintain a long-term safe state.
8.6.3 Banker’s Algorithm¶
Introduction: A More General Algorithm¶
The Resource-Allocation-Graph algorithm only works for single-instance resources. For systems with multiple identical instances of each resource type (e.g., 3 printers, 5 tape drives), we need a more general algorithm: the Banker's Algorithm.
- Origin of the Name: The algorithm mimics a conservative banker who must ensure that, when lending money to customers, they never commit all their cash in a way that prevents them from satisfying the potential future needs of all customers (which would cause the bank to fail). Similarly, the OS must not allocate resources in a way that could lead to an unsatisfiable future request (a deadlock).
- Efficiency: It is less efficient (more computationally expensive) than the graph algorithm but handles a broader class of systems.
The Setup: Declarations and Data Structures¶
The algorithm requires advance knowledge and maintains several key data structures.
Prerequisite: When a thread/process enters the system, it must declare its maximum demand for each resource type. This declaration cannot exceed the system's total resources.
Data Structures:
Let n = number of threads, m = number of resource types.
- Available (Vector of length
m):Available[j] = kmeans there are k free instances of resource typeRⱼright now. - Max (n x m matrix):
Max[i][j] = kmeans threadTᵢwill never ask for more than k instances of resource typeRⱼduring its entire run. This is the declared maximum. - Allocation (n x m matrix):
Allocation[i][j] = kmeans threadTᵢis currently holding k instances ofRⱼ. - Need (n x m matrix):
Need[i][j] = kmeans threadTᵢmay still need up to k more instances ofRⱼto finish its job. Crucially,Need[i][j] = Max[i][j] - Allocation[i][j]. This is a dynamic value that decreases as the thread gets resources and increases if it releases some before finishing.
Vector Notation: A Key Tool¶
To compare resource allocations and needs, we use vector comparisons.
- Let
XandYbe vectors of lengthm(each element represents a resource type). X ≤ Ymeans for every indexj,X[j] ≤ Y[j]. (Every resource count in X is less-than-or-equal to the corresponding count in Y).- Example: If
X = (1,7,3,2)andY = (0,3,2,1), thenY ≤ Xis true because 0≤1, 3≤7, 2≤3, and 1≤2. X < YmeansX ≤ YandX ≠ Y(at least one element is strictly less).
Applying to Matrices: We treat each thread's row in the Allocation and Need matrices as vectors:
Allocationᵢ= Rowiof the Allocation matrix = Resources currently held byTᵢ.Needᵢ= Rowiof the Need matrix = Future resourcesTᵢmay still request.
The Two Parts of the Banker's Algorithm¶
The algorithm consists of two distinct routines that are run at different times:
- Safety Algorithm: Determines if the current system state is safe. This is run to check the initial state and is the core subroutine called by...
- Resource-Request Algorithm: Run whenever a thread makes a request. It simulates granting the request and then uses the Safety Algorithm to check if the resulting state would be safe. If safe, it grants the request; if unsafe, it makes the thread wait.
The next sections will detail these algorithms.
8.6.3.1 Safety Algorithm¶
This algorithm answers the question: "Is the current system state safe?" It systematically searches for a safe sequence among the threads.
Algorithm Steps in Detail¶
We maintain two working vectors:
Work(lengthm): Represents the projected available resources as we simulate threads finishing. Initialized to the currentAvailablevector.Finish(lengthn): A boolean array whereFinish[i] = truemeans we have simulated threadTᵢfinishing its work. Initialized tofalsefor all threads.
Step 1: Initialization.
Work = Available(Start with what's currently free).Finish[i] = falsefor allifrom0ton-1(No thread has finished yet in our simulation).
Step 2: Search for a Thread that Can Finish.
Find an index i such that:
- Condition (a):
Finish[i] == false(The thread hasn't been simulated to finish yet). - Condition (b):
Needᵢ ≤ Work(The thread's remaining needs are less than or equal to the projected available resources). This means, in our simulation, this thread's future requests could be satisfied immediately.
If no such thread i exists, go to Step 4.
Step 3: Simulate the Thread Finishing.
Assume thread Tᵢ runs to completion. Update our simulation state:
Work = Work + Allocationᵢ(The thread releases all resources it currently holds. Add them to the projected available pool).Finish[i] = true(Mark this thread as having finished in our simulation).- Go back to Step 2 to look for the next thread that can finish with the now larger
Workvector.
Step 4: Check the Result.
After the loop ends (when no unfinished thread can satisfy its Need from Work), check the Finish array.
- If
Finish[i] == truefor alli, then all threads were able to finish in our simulation. Therefore, a safe sequence exists, and the system is in a SAFE state. - If some
Finish[i] == false, then those threads could not be guaranteed to finish. No safe sequence exists, and the system is in an UNSAFE state.
Complexity: This algorithm requires O(m × n²) operations in the worst case, as we might scan the list of n threads up to n times (once for each thread we add to the sequence), and each scan involves comparing an m-length vector (Needᵢ ≤ Work).
8.6.3.2 Resource-Request Algorithm¶
This algorithm is executed whenever a thread Tᵢ makes a new request for resources. It decides whether to grant the request immediately or make the thread wait. It uses the Safety Algorithm as a subroutine.
Let Requestᵢ be the request vector for thread Tᵢ. Requestᵢ[j] = k means Tᵢ wants k instances of resource type Rⱼ.
Algorithm Steps in Detail¶
Step 1: Validate the Request Against the Thread's Claim.
- Check: Is
Requestᵢ ≤ Needᵢ? (Is the request within the thread's declared remaining need?). - If NO: Raise an error. The thread is asking for more than it declared it would ever need (violates the
Maxclaim). - If YES: Proceed to Step 2.
Step 2: Check for Immediate Availability.
- Check: Is
Requestᵢ ≤ Available? (Are there enough free instances of each resource type to satisfy this request right now?). - If NO: The resources are not available. Thread
Tᵢmust wait. - If YES: Proceed to Step 3.
Step 3: Tentatively Allocate and Test for Safety. This is the core avoidance step. We pretend to grant the request and see if the resulting state is safe.
Tentatively modify the state as if the request was granted:
Available = Available – Requestᵢ(Deduct the requested resources from the free pool).Allocationᵢ = Allocationᵢ + Requestᵢ(Add the resources to the thread's allocation).Needᵢ = Needᵢ – Requestᵢ(Reduce the thread's remaining need).
Run the Safety Algorithm on this new, tentative state.
Make the Decision:
- If the resulting state is SAFE: The request is approved. The tentative changes are made permanent. The transaction is completed, and the thread gets its resources.
- If the resulting state is UNSAFE: The request is denied. Thread
Tᵢmust wait. Crucially, we roll back the tentative changes, restoringAvailable,Allocationᵢ, andNeedᵢto their values from before Step 3. The system state remains unchanged.
Key Points and Implications¶
- Wait Even When Resources are Free: A thread can be made to wait at Step 2 (resources not free) or at Step 3 (resources are free, but granting them would lead to an unsafe state). This is the essence of avoidance's conservatism.
- Ensures Continuous Safety: By only granting requests that lead to safe states, the algorithm guarantees the system always remains in the safe region of the state space (Figure 8.8).
- Overhead: The need to run the O(m × n²) Safety Algorithm on every resource request makes the Banker's Algorithm too expensive for general use in most OS kernels, but it is a foundational concept and can be used in constrained environments like database systems.
8.6.3.3 An Illustrative Example¶
This is a complete, numerical walkthrough of the Banker's Algorithm.
System Setup¶
- Threads: T₀, T₁, T₂, T₃, T₄ (n=5)
- Resource Types: A, B, C (m=3)
- Total Instances: A=10, B=5, C=7
- Current State (Snapshot):
| Thread | Allocation (A,B,C) | Max (A,B,C) | Need (A,B,C) = Max - Allocation |
|---|---|---|---|
| T₀ | (0, 1, 0) | (7, 5, 3) | (7, 4, 3) |
| T₁ | (2, 0, 0) | (3, 2, 2) | (1, 2, 2) |
| T₂ | (3, 0, 2) | (9, 0, 2) | (6, 0, 0) |
| T₃ | (2, 1, 1) | (2, 2, 2) | (0, 1, 1) |
| T₄ | (0, 0, 2) | (4, 3, 3) | (4, 3, 1) |
- Available Vector (Current Free Resources):
(3, 3, 2).- Check: Sum of Allocations = (7, 2, 5). Total Resources = (10,5,7). So Available = (10-7, 5-2, 7-5) = (3,3,2). Correct.
Step 1: Is the Initial System State Safe?¶
We claim it is safe. Let's verify by finding a safe sequence: < T₁, T₃, T₄, T₂, T₀ >.
We simulate using the Safety Algorithm:
- Initial:
Work = Available = (3,3,2).Finish= [false, false, false, false, false].
Iteration 1: Find a thread where Needᵢ ≤ Work.
- Check T₁:
Need₁ = (1,2,2). Is(1,2,2) ≤ (3,3,2)? Yes (1≤3, 2≤3, 2≤2). So T₁ can finish. - Simulate T₁ finishing:
Work = Work + Allocation₁ = (3,3,2) + (2,0,0) = (5,3,2). MarkFinish[1]=true.
Iteration 2: Work = (5,3,2). Find unfinished thread with Need ≤ Work.
- Check T₃:
Need₃ = (0,1,1) ≤ (5,3,2)? Yes. T₃ finishes. Work = (5,3,2) + (2,1,1) = (7,4,3).Finish[3]=true.
Iteration 3: Work = (7,4,3).
- Check T₄:
Need₄ = (4,3,1) ≤ (7,4,3)? Yes. T₄ finishes. Work = (7,4,3) + (0,0,2) = (7,4,5).Finish[4]=true.
Iteration 4: Work = (7,4,5).
- Check T₂:
Need₂ = (6,0,0) ≤ (7,4,5)? Yes. T₂ finishes. Work = (7,4,5) + (3,0,2) = (10,4,7).Finish[2]=true.
Iteration 5: Work = (10,4,7).
- Check T₀:
Need₀ = (7,4,3) ≤ (10,4,7)? Yes. T₀ finishes. Work = (10,4,7) + (0,1,0) = (10,5,7).Finish[0]=true.
Final Check: All Finish[i] are true. The state is SAFE. The safe sequence is <T₁, T₃, T₄, T₂, T₀>.
Step 2: Handling a Resource Request¶
Now, Thread T₁ makes a request: Request₁ = (1, 0, 2).
We follow the Resource-Request Algorithm:
Step 1: Validate against Need.
Request₁ = (1,0,2). Need₁ = (1,2,2). Is (1,0,2) ≤ (1,2,2)? Yes (1≤1, 0≤2, 2≤2). Request is within declared needs.
Step 2: Check against Available.
Available = (3,3,2). Is (1,0,2) ≤ (3,3,2)? Yes (1≤3, 0≤3, 2≤2). Resources are free.
Step 3: Tentative Allocation and Safety Check.
- Pretend to allocate:
Available = (3,3,2) - (1,0,2) = (2,3,0)Allocation₁ = (2,0,0) + (1,0,2) = (3,0,2)Need₁ = (1,2,2) - (1,0,2) = (0,2,0)
- New Tentative State:
| Thread | Allocation | Need | Available = (2,3,0) |
|---|---|---|---|
| T₀ | (0,1,0) | (7,4,3) | |
| T₁ | (3,0,2) | (0,2,0) | |
| T₂ | (3,0,2) | (6,0,0) | |
| T₃ | (2,1,1) | (0,1,1) | |
| T₄ | (0,0,2) | (4,3,1) |
Run Safety Algorithm on this new state to see if it's safe.
Work = (2,3,0),Finishall false.- Find thread with
Need ≤ Work. T₁'s newNeed₁=(0,2,0) ≤ (2,3,0)? Yes. Simulate T₁ finish:Work = (2,3,0)+(3,0,2)=(5,3,2).Finish[1]=true. Work=(5,3,2). T₃:Need₃=(0,1,1) ≤ (5,3,2)? Yes.Work=(5,3,2)+(2,1,1)=(7,4,3).Finish[3]=true.Work=(7,4,3). T₄:Need₄=(4,3,1) ≤ (7,4,3)? Yes.Work=(7,4,3)+(0,0,2)=(7,4,5).Finish[4]=true.Work=(7,4,5). T₀:Need₀=(7,4,3) ≤ (7,4,5)? Yes.Work=(7,4,5)+(0,1,0)=(7,5,5).Finish[0]=true.Work=(7,5,5). T₂:Need₂=(6,0,0) ≤ (7,5,5)? Yes.Work=(7,5,5)+(3,0,2)=(10,5,7).Finish[2]=true.
All threads finished. The new state is SAFE (sequence
<T₁, T₃, T₄, T₀, T₂>works).
Conclusion: Since the tentative state is safe, the request is GRANTED. The new system state becomes the one we calculated above.
Important Observations from the Example¶
A Request That Cannot Be Granted (Insufficient Resources):
- If T₄ now requests
(3,3,0), Step 2 of the algorithm fails:Request₄ = (3,3,0)is NOT ≤Available = (2,3,0)(because 3 ≤ 2 is false for resource A). T₄ must wait because the resources simply aren't free.
- If T₄ now requests
A Request That Cannot Be Granted (Leads to Unsafe State):
- In the original safe state, suppose T₀ requests
(0,2,0).- Step 1:
Need₀ = (7,4,3), so(0,2,0) ≤ (7,4,3)is true. - Step 2:
Available = (3,3,2), so(0,2,0) ≤ (3,3,2)is true. - Step 3: Tentative allocation would give
Available = (3,1,2). Running the Safety Algorithm on this new state would likely fail (try it: after T₁ and T₃ finish,Workmight be insufficient for T₄ or T₀). The state would be unsafe.
- Step 1:
- Therefore, the request is DENIED, even though the resources (0,2,0) are currently available! This is the essence of deadlock avoidance.
- In the original safe state, suppose T₀ requests
Programming Exercise: Implementing this algorithm is an excellent way to understand its mechanics fully.
8.7 Deadlock Detection¶
Introduction: The Detection and Recovery Strategy¶
This is the third major strategy for handling deadlocks (after Ignoring, and Prevention/Avoidance). It involves two distinct phases:
- Detection Algorithm: Periodically examine the system state to determine if a deadlock has already occurred.
- Recovery Algorithm: If a deadlock is detected, take action to break it (e.g., abort threads, preempt resources).
Important Trade-off: This scheme introduces overhead:
- Runtime Costs: Maintaining state information and periodically executing the detection algorithm consumes CPU cycles.
- Recovery Losses: Recovering from a deadlock often involves rolling back or terminating threads, which means losing work.
We will examine detection algorithms for two scenarios: (1) systems where each resource has a single instance, and (2) systems with multiple instances of each resource type.
8.7.1 Single Instance of Each Resource Type¶
For systems where each resource type has exactly one instance (e.g., one scanner, one specific lock), we can use a simplified version of the resource-allocation graph called the Wait-For Graph.
Constructing the Wait-For Graph¶
Go to Figure 8.11. This shows the transformation.
- (a) Resource-Allocation Graph (RAG): The standard graph with thread nodes (circles) and resource nodes (rectangles), connected by request and assignment edges.
- Transformation Rule: To create the Wait-For Graph:
- Remove all resource nodes (rectangles).
- Collapse the edges. For any resource node
R_qthat has an incoming request edge from threadT_iand an outgoing assignment edge to threadT_j, create a directed edgeT_i → T_jin the wait-for graph.
- Interpretation of an Edge
T_i → T_j: ThreadT_iis waiting for threadT_jto release a specific resource thatT_ineeds.
Example from Figure 8.11:
- In the RAG (a), thread
T₁has a request edge to resourceR₁, andR₁has an assignment edge toT₂. This collapses to a wait-for edgeT₁ → T₂in graph (b), meaning "T₁ waits for T₂". - Similarly,
T₂ → R₂andR₂ → T₄becomesT₂ → T₄. T₄ → R₅andR₅ → T₅becomesT₄ → T₅.T₅ → R₃andR₃ → T₁becomesT₅ → T₁.
The Deadlock Detection Rule¶
For systems with single-instance resources, the rule is simple and definitive:
A deadlock exists in the system if and only if the wait-for graph contains a directed cycle.
- Why? A cycle in the wait-for graph, e.g.,
T₁ → T₂ → T₄ → T₅ → T₁, directly represents a circular wait: T₁ waits for T₂, who waits for T₄, who waits for T₅, who waits for T₁. With single instances, this circular wait is a deadlock. - Detection Algorithm: The system must periodically (or on suspicion) run a cycle-detection algorithm on the wait-for graph. The complexity is O(n²), where
nis the number of threads (vertices).
Practical Tool: The BCC deadlock_detector¶
The BPF Compiler Collection (BCC) toolkit for Linux (introduced in Section 2.10.4) includes a practical tool for deadlock detection in user-space Pthreads programs.
- How it Works: It uses dynamic tracing to insert probes (hooks) into the
pthread_mutex_lock()andpthread_mutex_unlock()function calls in a target process. - Action: Whenever the traced process calls these functions, the
deadlock_detectortool constructs a real-time wait-for graph of the mutex locks in that process. - Detection: It analyzes this graph and reports a potential deadlock if it detects a cycle.
- Purpose: This is a debugging and development tool, not a production runtime deadlock solver. It helps programmers identify code paths that can lead to deadlock.
Summary for Single-Instance Resources: Detection is conceptually straightforward—maintain a wait-for graph and look for cycles. The challenge lies in the overhead of maintaining the graph and running the cycle detection algorithm.
8.7.2 Several Instances of a Resource Type¶
The Wait-For Graph method doesn't work when resources have multiple instances. For these systems, we need a detection algorithm that closely resembles the Banker's Algorithm, but with a crucial difference in purpose: the Banker's Algorithm prevents unsafe states, while this algorithm detects actual deadlocks after they may have occurred.
Data Structures¶
We maintain three key data structures, similar to the Banker's algorithm:
Available: Vector of lengthm.Available[j] = kmeanskinstances of resource typeRⱼare currently free.Allocation:n x mmatrix.Allocation[i][j] = kmeans threadTᵢcurrently holdskinstances ofRⱼ.Request:n x mmatrix.Request[i][j] = kmeans threadTᵢis currently blocked, waiting forkmore instances ofRⱼ. (This is different from the Banker'sNeedmatrix, which was a future maximum.Requestis the current, outstanding request that caused the thread to wait.)
We treat rows as vectors: Allocationᵢ, Requestᵢ. The ≤ relation for vectors is defined as before.
The Deadlock Detection Algorithm¶
This algorithm simulates a selective reduction of the system. It tries to find an order in which blocked threads could eventually complete, assuming they get the resources they're waiting for. Threads that are not blocked are assumed to finish quickly and release their resources.
Step 1: Initialization.
Work = Available(Start with currently free resources).- For each thread
Tᵢ:- If
Allocationᵢ != 0(the thread holds some resources), setFinish[i] = false. (We need to check if it can finish). - If
Allocationᵢ == 0(the thread holds no resources), setFinish[i] = true. (A thread with no resources cannot be part of a resource deadlock. It may be runnable or waiting for an event, but not waiting for resources held by others.)
- If
Step 2: Find a Thread That Could Be Satisfied.
Find an index i such that:
- (a)
Finish[i] == false(Thread is still under consideration). - (b)
Requestᵢ ≤ Work(The thread's current waiting request can be satisfied by the currently available resources in our simulation).
If no such i exists, go to Step 4.
Step 3: Simulate Thread Completion.
Assume Tᵢ gets its requested resources, finishes its task, and releases all its held resources.
Work = Work + Allocationᵢ(AddTᵢ's held resources to the free pool).Finish[i] = true.- Go back to Step 2.
Step 4: Analyze the Results.
After the loop ends (no more threads satisfy condition 2b), examine the Finish array.
- If
Finish[i] == falsefor somei, then threadTᵢis deadlocked. The set of all threads withFinish[i] == falseis the deadlocked set. - If
Finish[i] == truefor alli, then no deadlock exists. The sequence of threads we markedtruein Step 3 represents an order in which all threads could have completed, so the system state was not deadlocked.
Complexity: O(m × n²) operations, same as the Safety Algorithm.
Why the Algorithm Works: The "Optimistic Assumption"¶
A key point in Step 3: We immediately reclaim a thread's resources once we see its Requestᵢ ≤ Work. Why?
- Reasoning: If a thread's current request can be satisfied (
Requestᵢ ≤ Work), it is not currently deadlocked in our simulation snapshot. We make the optimistic assumption that this thread will proceed, finish quickly, and free its resources. This allows our simulation to "unlock" more resources for other threads. - If the assumption is wrong (e.g., the thread immediately requests more resources after the ones we just gave it), that new request could cause a deadlock later. Our algorithm is a snapshot detection; that later deadlock will be caught the next time the detection algorithm is run.
Illustrated Example¶
System: 5 threads (T₀..T₄), 3 resource types (A, B, C). Total: A=7, B=2, C=6.
Snapshot 1: (No Deadlock) Given State:
- Allocation Matrix:
- T₀: (0,1,0)
- T₁: (2,0,0)
- T₂: (3,0,3)
- T₃: (2,1,1)
- T₄: (0,0,2)
- Request Matrix (Current Outstanding Requests):
- T₀: (0,0,0) → Not blocked, requesting nothing.
- T₁: (2,0,2) → Blocked, waiting for 2 A and 2 C.
- T₂: (0,0,0) → Not blocked.
- T₃: (1,0,0) → Blocked, waiting for 1 A.
- T₄: (0,0,2) → Blocked, waiting for 2 C.
- Available Vector:
(0,0,0)(All resources are allocated).
Run the Detection Algorithm on Snapshot 1:
Work = (0,0,0).Finish= [false, false, false, false, false] (all threads hold resources, so all initially false).- Find
iwhereFinish[i]==falseandRequestᵢ ≤ Work.- T₀:
Request₀=(0,0,0) ≤ (0,0,0)? Yes. - (We find T₀ first. Could also find T₂, which also has
Request=(0,0,0)).
- T₀:
- Simulate T₀ finish:
Work = (0,0,0) + Allocation₀ = (0,0,0)+(0,1,0) = (0,1,0).Finish[0]=true. - Back to Step 2:
Work=(0,1,0). Find nexti. T₂:Request₂=(0,0,0) ≤ (0,1,0)? Yes. Simulate T₂:Work = (0,1,0)+(3,0,3) = (3,1,3).Finish[2]=true. Work=(3,1,3). Find next. T₃:Request₃=(1,0,0) ≤ (3,1,3)? Yes. Simulate T₃:Work = (3,1,3)+(2,1,1) = (5,2,4).Finish[3]=true.Work=(5,2,4). Find next. T₁:Request₁=(2,0,2) ≤ (5,2,4)? Yes. Simulate T₁:Work = (5,2,4)+(2,0,0) = (7,2,4).Finish[1]=true.Work=(7,2,4). Find last. T₄:Request₄=(0,0,2) ≤ (7,2,4)? Yes. Simulate T₄:Work = (7,2,4)+(0,0,2) = (7,2,6).Finish[4]=true.- Step 4: All
Finish[i]are true. No deadlock. The sequence<T₀, T₂, T₃, T₁, T₄>is a feasible completion order.
Snapshot 2: (Deadlock Introduced) Now, Thread T₂ makes an additional request for 1 instance of C.
- New Request Matrix: T₂'s row changes from
(0,0,0)to(0,0,1). - Now, T₂ is blocked waiting for 1 C.
Run the Detection Algorithm on Snapshot 2:
Work = (0,0,0).Finishall false.- Find
iwhereFinish[i]==falseandRequestᵢ ≤ Work.- Check T₀:
(0,0,0) ≤ (0,0,0)? Yes. Simulate T₀ finish.Work = (0,1,0).Finish[0]=true.
- Check T₀:
- Now
Work=(0,1,0). Find nexti.- Check T₁:
(2,0,2) ≤ (0,1,0)? No. - Check T₂:
(0,0,1) ≤ (0,1,0)? No (needs 1 C, but 0 C available). - Check T₃:
(1,0,0) ≤ (0,1,0)? No (needs 1 A, but 0 A available). - Check T₄:
(0,0,2) ≤ (0,1,0)? No.
- Check T₁:
- No thread satisfies condition 2b. Go to Step 4.
Finishis false for T₁, T₂, T₃, T₄. The system is deadlocked. Threads T₁, T₂, T₃, and T₄ are in the deadlock set. (T₀ is not deadlocked; it could have run).
Conclusion: The algorithm successfully identified the deadlock in the second scenario by simulating resource reclamation and finding that a set of threads could not make progress.
8.7.3 Detection-Algorithm Usage¶
A critical practical question: How often should we run the deadlock detection algorithm?
The frequency is a trade-off based on two key factors:
- How often do deadlocks occur? (Frequency)
- How many threads are impacted when a deadlock happens? (Severity)
Strategy 1: Detect on Every Blocking Request (Extreme, Precise)¶
- Method: Invoke the detection algorithm every time a thread makes a resource request that cannot be granted immediately (i.e., the thread is about to block).
- Advantages:
- Immediate Detection: You detect a deadlock the instant it occurs.
- Identifies the "Trigger": You can pinpoint the specific thread and request that completed the wait cycle and caused the deadlock. (Though technically, all threads in the cycle jointly caused it).
- Disadvantage:
- Massive Overhead: Running an O(n²) algorithm on every blocking request can severely degrade system performance, especially in busy systems.
Strategy 2: Periodic Detection (Practical, Common)¶
- Method: Invoke the detection algorithm at regular intervals (e.g., every few minutes, every hour) or based on a system performance heuristic.
- Common Heuristic: Run detection when CPU utilization drops below a threshold (e.g., 40%). The rationale is that a deadlock will eventually cause threads to stall, reducing useful work and lowering CPU utilization.
- Advantages:
- Lower Overhead: Significantly reduces computational cost compared to per-request detection.
- Disadvantages:
- Delayed Detection: A deadlock may exist for some time before being detected, leaving resources idle.
- Cannot Identify the Cause: By the time the algorithm runs, the resource graph may contain multiple cycles. It's impossible to tell which particular request was the final trigger; you only see the resulting deadlocked set.
Choosing the Right Frequency¶
- If deadlocks are frequent: Detection must be invoked more often to minimize resource idle time and prevent the deadlock from growing (more threads might become involved).
- If deadlocks are rare: You can afford to run detection less frequently (e.g., periodically) to save on overhead.
Feature Box: Managing Deadlock in Databases¶
Database systems are a prime real-world example of the detect-and-recover strategy in action.
- Why Deadlocks Happen: Database operations are performed as transactions that acquire locks on data items (rows, tables) to ensure consistency. With multiple concurrent transactions, deadlock is a common risk.
- Detection: The database server periodically (not on every lock wait) searches for cycles in a wait-for graph representing transactions waiting for locks.
- Recovery: Upon detection, the system:
- Selects a Victim: Chooses a transaction to abort (e.g., MySQL picks the transaction with the minimal number of affected rows to reduce rollback cost).
- Performs Rollback: Aborts the victim transaction, which releases all its locks.
- Allows Others to Proceed: The remaining transactions are now free from deadlock and can resume.
- Reissues the Victim: The aborted transaction is typically restarted from the beginning by the application or database driver.
- This model works well because transactions are designed to be atomic and support rollback, making recovery feasible.
Key Insight: The choice of detection frequency is an engineering compromise. No single answer is correct for all systems. It depends on the cost of deadlock versus the cost of detection.
8.8 Recovery from Deadlock¶
Once a deadlock is detected, the system must recover from it. There are two broad approaches:
- Manual Recovery: Inform a human operator and let them handle it (e.g., through system commands).
- Automatic Recovery: The operating system itself breaks the deadlock using one of two fundamental techniques:
- Terminate one or more deadlocked threads/processes.
- Preempt resources from one or more deadlocked threads.
We will focus on automatic recovery methods.
8.8.1 Process and Thread Termination¶
This method breaks a deadlock by selectively killing deadlocked threads/processes, thereby releasing all their held resources and breaking the circular wait.
Two Termination Strategies:¶
1. Terminate All Deadlocked Processes:
- Action: Abort every process/thread that is part of the deadlock cycle.
- Advantage: Guaranteed to break the deadlock quickly and simply.
- Disadvantage: Extremely costly. These processes may have been running for a long time; their partial computations are lost entirely and must be redone from scratch. This represents a huge waste of work.
2. Terminate One Process at a Time:
- Action: Abort one deadlocked process, release its resources, then re-run the deadlock detection algorithm to see if the deadlock is resolved.
- Process: Repeat (select another victim, abort, re-detect) until the deadlock cycle is finally broken.
- Advantage: Potentially minimizes the amount of lost work, as you might only need to abort one or a few processes.
- Disadvantage: High overhead because the detection algorithm (O(n²)) must be run after each termination to re-evaluate the system state.
The Practical Challenge of Termination¶
Aborting a process is not a clean "undo." It can leave the system in an inconsistent state:
- Files: If the process was halfway through writing a file, the file may be corrupted or left with partial, meaningless data.
- Shared Data & Locks: If the process was holding a mutex lock while modifying shared data structures, aborting it will force the OS to mark the lock as available, but the shared data itself may be left in a corrupted, partially updated state. This can break other processes.
Choosing the "Best" Victim: A Cost-Based Policy¶
When using the "one at a time" method, a critical decision is: Which deadlocked process should we terminate first? This is a policy decision (like CPU scheduling) based on minimizing "cost." Factors to consider include:
- Process Priority: Terminate a low-priority process before a high-priority one.
- Computation Time Invested: How long has the process been running? Terminating a process that is 99% done is more wasteful than terminating one that just started. (But you also need to consider how much longer it would run if allowed to live).
- Resource Types Held: Are the resources held by the process easy to preempt and restore? Terminating a process holding simple resources is better than terminating one holding complex, stateful resources.
- Future Resource Needs: How many more resources does the process need to finish? A process that is almost done (needs few resources) might be a poor victim, as it's close to releasing everything.
- Cascading Termination: How many other processes will need to be terminated if this one is killed? (e.g., if this process is a parent or is communicating with others). The goal is to minimize the total number of terminated processes.
The OS must implement a heuristic or cost function that weighs these factors to select the least costly victim.
8.8.2 Resource Preemption¶
This recovery method breaks deadlocks by forcibly taking away (preempting) resources from some deadlocked threads and giving them to others, thereby breaking the circular wait. It's less drastic than outright termination but introduces significant complexity.
Three Critical Issues to Address:¶
1. Selecting a Victim¶
We must decide which resource(s) and from which thread(s) to preempt.
- Goal: Minimize the "cost" of preemption.
- Cost Factors are similar to termination and may include:
- Number and Type of Resources Held: Preempting a single, simple resource is cheaper than preempting several complex ones.
- Amount of Computation Time Already Invested: Preempting from a thread that has run for a long time may waste more work if it has to roll back far.
- Progress to Completion: A thread close to finishing might be a poor victim.
- The system uses these factors to pick the least costly victim thread and resource combination.
2. Rollback¶
Once a resource is preempted from a thread, what happens to that thread? It cannot proceed normally because it's now missing a resource it needs.
- The Core Problem: We must roll the thread back to a previous, safe state from which it can restart, hopefully without the same deadlock recurring.
- What is a Safe State? It's a state (a snapshot of the thread's memory, registers, and resource holdings) that is consistent and correct to restart from. Determining this automatically is very difficult.
- Common (Simple) Solution: Total Rollback. Abort the victim thread completely and restart it from the beginning. This is essentially termination + restart. It's simple but wastes all the work the thread did up to that point.
- More Sophisticated (Complex) Solution: Partial Rollback. Roll the thread back only far enough to a known checkpoint before it acquired the preempted resource, then restart from there. This requires the system to maintain detailed logs (checkpoints) of each thread's state over time, which adds significant memory and runtime overhead.
3. Starvation¶
- The Problem: If victim selection is purely based on a static cost factor (e.g., "always preempt from the thread holding the fewest resources"), the same thread could be repeatedly chosen as the victim. It would never make progress, leading to starvation.
- The Solution: The victim selection policy must guarantee that no thread is picked as a victim indefinitely.
- Practical Implementation: Include the number of times a thread has already been rolled back due to preemption as a cost factor in the selection algorithm. As a thread's rollback count increases, its effective "cost" for being chosen again should increase dramatically, making it less likely to be selected repeatedly. This ensures fairness and prevents starvation.
Summary of Recovery Methods¶
- Termination is simpler but destructive (loses work, risks data corruption).
- Resource Preemption is more flexible but complex, requiring solutions for victim selection, rollback, and starvation prevention. It is most feasible in systems that already support transactions and checkpointing (like databases).
8.9 Summary¶
Core Concepts¶
- Definition: A deadlock is a state where every process/thread in a set is waiting for an event (typically resource acquisition) that can only be caused by another member of the same set. The system grinds to a halt.
- Necessary Conditions: All four must hold simultaneously for a deadlock to be possible:
- Mutual Exclusion: Resources are non-sharable.
- Hold and Wait: A thread holds resources while waiting for others.
- No Preemption: Resources cannot be forcibly taken away.
- Circular Wait: A circular chain of waiting threads exists.
Models and Tools¶
- Resource-Allocation Graph (RAG): A directed graph model used to visualize resource allocation and requests. For single-instance resource types, a cycle in the graph is both necessary and sufficient for deadlock.
- Wait-For Graph: A simplified RAG used for deadlock detection in single-instance systems. A cycle indicates deadlock.
Strategies for Handling Deadlocks¶
The chapter presented three overarching strategies, each with specific techniques:
1. Deadlock Prevention¶
- Goal: Design the system so deadlock is structurally impossible.
- Method: Negate at least one of the four necessary conditions.
- Most Practical Method: Negate Circular Wait by imposing a total ordering on all resource types and requiring that resources be requested in strictly increasing order. This is the primary technique programmers use to write deadlock-free multithreaded code.
2. Deadlock Avoidance¶
- Goal: Allow more flexibility than prevention, but dynamically deny requests that could lead to a future deadlock.
- Requirement: Advance knowledge of each thread's maximum resource needs.
- Key Algorithm: The Banker's Algorithm.
- It maintains data structures (
Available,Max,Allocation,Need). - Defines a safe state (where a safe sequence of thread completions exists).
- On each request, it performs a safety check and only grants the request if the resulting state remains safe. It may deny a request even if resources are free.
- It maintains data structures (
- Trade-off: Safer than ignoring deadlocks, but incurs significant runtime overhead.
3. Deadlock Detection and Recovery¶
- Goal: Allow deadlocks to occur, then identify and break them.
- Detection: Algorithms periodically examine system state.
- For single-instance resources: Use cycle detection in the wait-for graph.
- For multiple-instance resources: Use an algorithm similar to the Banker's, but using the current
Requestmatrix instead of futureNeed.
- Recovery: Two main methods once deadlock is detected:
- Process/Termination: Abort one or more deadlocked threads. Requires a victim selection policy based on minimizing cost.
- Resource Preemption: Forcibly take resources from some threads and give to others. Must solve victim selection, rollback (to a safe state), and starvation prevention.
Practical Reality¶
- Most general-purpose operating systems (Linux, Windows) use no automatic deadlock handling in the kernel for user processes—they effectively ignore the problem due to performance overhead.
- Therefore, the responsibility falls on application developers to use prevention techniques (especially lock ordering) to write correct, deadlock-free concurrent programs.
- Specialized systems like databases successfully employ the detect-and-recover strategy, leveraging transactional rollback for recovery.
Key Takeaway: Deadlock is a fundamental challenge in concurrency. Understanding its conditions provides the tools to prevent it in code, while the algorithms (Banker's, detection) provide theoretical and specialized practical solutions for automated management.
Chapter 9: Main Memory¶
Section 9.1: Background¶
1. The Central Role of Memory¶
In a modern computer system, memory (or main memory/RAM) is a critical component. Think of it as a vast, contiguous array of storage cells (bytes), where each byte has a unique address, like house numbers on a very long street.
The CPU's operation is an endless cycle of interacting with this memory:
- Fetch Instruction: The CPU looks at the Program Counter (PC) register, which holds the memory address of the next instruction. It goes to that address in memory and fetches the instruction.
- Decode & Execute: The CPU decodes the instruction. This instruction might itself require fetching data (operands) from specific memory addresses.
- Store Results: After performing the operation, the CPU may store the result back into a specified memory address.
This cycle highlights that nearly every CPU action involves memory access.
2. The Memory Unit's Perspective¶
From the viewpoint of the memory hardware unit, its job is simple:
- Input: A stream of memory addresses.
- Action: For each address, either read the data stored there or write new data to it.
- Crucial Point: The memory unit is "dumb." It does not know or care how an address was generated (by the program counter, by calculating an offset, etc.) or what the data at that address represents (an instruction, an integer, text, etc.). Its sole concern is the address number itself.
Therefore, when studying memory management from the OS perspective, we focus on the sequence of memory addresses produced by a running program, not on their semantic meaning.
3. Key Issues in Memory Management¶
This section will explore several foundational concepts needed to understand how an operating system manages this crucial resource:
- Basic Hardware: What hardware mechanisms (like base and limit registers) are needed to support memory management and protection.
- Address Binding: Programs are written using symbolic names (like variable names). These must eventually be translated into specific physical memory addresses. The process and timing of this binding is a key concept.
- Logical vs. Physical Addresses:
- Logical Address (Virtual Address): An address generated by the CPU during program execution. It's the program's view of memory.
- Physical Address: The actual address used by the memory hardware unit. The Memory Management Unit (MMU) hardware dynamically translates logical addresses into physical addresses.
- Dynamic Linking and Shared Libraries: How multiple programs can share common library code in memory, reducing duplication and saving memory space. This involves delaying the linking of some code until the program is actually running.
Next Steps: We will first look at the hardware support required to implement these concepts effectively. (As per your book's structure, this leads into discussions of hardware, address binding, and the logical/physical distinction).
Section 9.1.1: Basic Hardware¶
1. Direct CPU-Accessible Storage¶
The CPU can directly interact with only two types of storage:
- CPU Registers: Built into each processor core. Access is extremely fast (typically within one CPU clock cycle).
- Main Memory (RAM): Accessed via the system bus. Access is much slower, taking many CPU clock cycles.
Key Implication: All instructions and data that the CPU is actively working on must be copied into main memory (or registers). Data on disk must be moved to RAM before the CPU can use it. There are no CPU instructions that operate directly on disk addresses.
2. The Speed Problem and Caching¶
The massive speed difference between register access and RAM access creates a performance bottleneck. If the CPU had to wait (stall) for every memory access, system performance would be terrible.
The Solution: Caches. Fast, small memory located between the CPU and main RAM (often on the CPU chip itself). The cache automatically holds frequently used data from main memory, dramatically speeding up access. Cache management is handled transparently by hardware, not the OS.
(Note: As mentioned in your computer architecture course, this is the memory hierarchy. Multithreaded cores can also switch to another thread during a stall to utilize idle cycles.)
3. The Protection Problem¶
We must prevent user processes from interfering with each other or with the operating system itself. This memory protection must be enforced by hardware because software (OS) checking every memory access would be too slow.
4. Hardware Solution: Base and Limit Registers¶
A classic and fundamental hardware mechanism for memory protection uses two special CPU registers:
- Base Register: Holds the starting physical address of a process's memory region.
- Limit Register: Holds the size (length) of that memory region.
Together, these registers define a process's logical address space in physical memory. Go to Figure 9.1 in your book. This diagram shows how the OS occupies its own space in low memory, and each user process is allocated its own contiguous block defined by a unique base and limit.
How Protection Works (The MMU's Role):¶
The Memory Management Unit (MMU) is the hardware component that performs this check. Here's the sequence for every memory address generated by a user process:
- The CPU generates a logical address (e.g., from 0 to limit-1).
- The MMU adds this logical address to the value in the base register to produce a physical address.
- Crucially, before allowing the access, the MMU checks if the logical address is less than the value in the limit register.
- If the address is ≥ the limit, it is illegal. The MMU triggers a hardware trap (an interrupt) to the OS, which typically terminates the offending process.
Go to Figure 9.2 in your book. This flowchart visualizes the hardware's address validation process, showing the comparison and the trap to the OS on violation.
Example:¶
If a process has Base = 300040 and Limit = 120900:
- Its legal physical address range is from
300040to(300040 + 120900 - 1) = 420939. - A process request for logical address
0becomes physical address300040. ✅ - A process request for logical address
120899becomes physical address420939. ✅ - A process request for logical address
120900(equal to limit) is invalid and causes a trap. ❌
5. Operating System Privilege¶
The base and limit registers are privileged resources.
- They can only be loaded by instructions that execute in kernel mode (a higher privilege level).
- Only the operating system runs in kernel mode.
- This prevents a user process from changing its own memory bounds to access unauthorized areas.
The OS itself, running in kernel mode, has unrestricted access to all of memory. This is necessary for it to perform its duties: loading programs, handling system calls, performing I/O between user memory and devices, and switching between processes (context switches), which involves saving/restoring process state (including registers) to/from memory.
Section 9.1.2: Address Binding¶
1. The Journey of a Program to Memory¶
A program starts its life as an executable file on disk. To run, it must be loaded into main memory and become a process. The process then executes, fetching instructions and data from memory, and finally terminates, freeing its memory.
A key point: the process can be placed anywhere in physical memory. Its starting address in memory does not have to be zero, even though the computer's physical address numbering starts at 0. The OS decides where to place it.
2. Address Representations and Binding Steps¶
Address Binding is the process of translating the symbolic addresses in a program (like variable names) into actual, numeric physical memory addresses. This happens in stages. Go to Figure 9.3 in your book, which illustrates this multistep pipeline.
Here are the stages and the form of addresses at each stage:
- Source Program: Uses symbolic addresses (e.g.,
count,main,loop). - Compilation: The compiler translates symbolic addresses into relocatable addresses (also called logical addresses). These are offsets relative to the start of the module (e.g., "byte 14 within this function" or "the variable located 200 bytes from the start of the data section").
- Linking/Loading: The linker (which combines multiple object files and libraries) and the loader (which places the program into memory) perform the final mapping to absolute physical addresses (e.g., physical memory byte
74014).
Each step is a mapping from one address space to another.
3. Timing of Address Binding (A Critical Concept)¶
The final binding to a physical address can occur at different times, each with implications for flexibility and hardware requirements.
Option 1: Binding at Compile Time¶
- How it works: The compiler knows in advance the exact memory location (say, address
R) where the process will be loaded. It generates absolute code containing direct physical addresses. - Implication: If the starting location
Rever needs to change (e.g., the OS evolves, or you want to run multiple copies), you must recompile the entire program. - Use Case: Rare for general-purpose OSes. Might be used in simple, dedicated embedded systems with fixed memory layouts.
Option 2: Binding at Load Time¶
- How it works: The compiler produces relocatable code (code that uses offsets). The final physical address is calculated when the program is loaded into memory. The loader adds the actual starting physical address to every relocatable address.
- Implication: If you need to move the program, you just need to reload it; no recompilation is necessary. However, once loaded, the process cannot be moved in memory during execution.
- Use Case: Used in some older or simpler systems. Provides more flexibility than compile-time binding.
Option 3: Binding at Execution Time (Run Time)¶
- How it works: Binding is delayed until the moment the instruction is executed. The process can be moved during execution from one memory region to another. This requires specialized hardware support—specifically, a Memory Management Unit (MMU) with registers like the base register we just discussed.
- How it works with hardware: The CPU generates logical addresses (e.g., from 0 to N). The MMU's base register, which holds the current physical starting address of the process, is added to every logical address to produce the physical address on the fly. If the OS moves the process, it simply updates the base register.
- Implication: Maximum flexibility. Enables advanced techniques like virtual memory, where processes can be swapped to disk and back into different memory locations.
- Use Case: This is the method used by almost all modern general-purpose operating systems (Windows, Linux, macOS).
Summary¶
The choice of binding time represents a trade-off:
- Early Binding (Compile/Load Time): Simpler, less hardware needed, but inflexible.
- Late Binding (Execution Time): Requires complex hardware (MMU) but provides powerful flexibility for the OS to manage memory dynamically, which is essential for multitasking and efficient memory utilization.
Section 9.1.3: Logical Versus Physical Address Space¶
1. Defining the Two Key Address Types¶
It is crucial to distinguish between the two perspectives on memory addresses in a system:
- Logical Address (Virtual Address): An address generated by the CPU during the execution of a program. This is the address the program itself "sees" and works with. It is process-centric.
- Physical Address: An address that is actually placed on the memory bus and used by the memory hardware unit to read from or write to a specific cell in RAM. This is the hardware-centric view.
Go to Figure 9.4 in your book. This simple diagram shows the data flow: the CPU outputs a logical address, the MMU transforms it, and the memory receives a physical address.
2. The Relationship Depends on Binding Time¶
The relationship between logical and physical addresses is determined by when address binding occurs:
- Compile-time or Load-time Binding: Here, the logical address generated by the CPU is identical to the physical address used by memory. There is a one-to-one, fixed mapping decided early on.
- Execution-time (Run-time) Binding: This is where the key distinction arises. The logical address (now specifically called a virtual address) is different from the physical address. The mapping is dynamic and can change while the program runs.
Therefore:
- The collection of all logical addresses a program can generate is its logical address space (or virtual address space).
- The collection of all actual locations in physical memory that these logical addresses map to is its physical address space.
- With run-time binding, the logical and physical address spaces are separate but connected through dynamic mapping.
3. The Hardware Enabler: The Memory Management Unit (MMU)¶
The runtime translation from logical to physical address is performed by a dedicated hardware component called the Memory Management Unit (MMU).
- The MMU sits logically between the CPU and the main memory.
- Every memory reference from the CPU (instruction fetch, data load/store) is intercepted by the MMU and translated.
4. A Fundamental MMU Scheme: Dynamic Relocation¶
The simplest MMU scheme is a direct generalization of the base-register protection method, now focused on translation. In this context, the base register is often called a relocation register.
How Dynamic Relocation Works:
- The CPU, while executing a user process, generates a logical address (e.g.,
346). The process believes it is accessing its own "local" address346. - This logical address is sent to the MMU.
- The MMU adds the value in the relocation register (R) to the logical address.
- The result is the physical address (e.g.,
R + 346) that is sent to the memory bus.
Go to Figure 9.5 in your book. This figure perfectly illustrates the process: the CPU outputs logical address 346, the relocation register holds value 14000, the MMU adds them, and physical address 14346 is sent to memory.
Example: If the relocation register (base) is set to 14000:
- Logical address
0→ Physical address14000 - Logical address
346→ Physical address14346 - Logical address
MAX(the process limit) → Physical address14000 + MAX
5. The Illusion Created for the User Program¶
A powerful abstraction is created:
- The user program operates solely in its logical address space (0 to max). It creates pointers, compares addresses, and performs calculations all within this logical space, completely unaware of where it physically resides in RAM.
- The operating system (with help from the MMU) manages the physical address space. It decides the value of the relocation register (R) for each process. It can change this value to move a process in memory without the process ever knowing.
- The actual, final physical location of any memory reference is not determined until the very moment the CPU generates that reference and the MMU translates it.
6. Central Concept of Modern Memory Management¶
The separation of the logical address space (belonging to the process) from the physical address space (belonging to the hardware) is the foundational idea behind all advanced memory management techniques. This separation is what allows for:
- Process Isolation: Each process has its own private, contiguous address space starting at 0.
- Dynamic Relocation: Processes can be moved in physical memory during execution.
- Memory Protection: The limit register check (from 9.1.1) ensures a process cannot generate logical addresses outside its allocated range.
- The eventual implementation of Virtual Memory: Where the logical address space can be much larger than the physical address space.
Section 9.1.4: Dynamic Loading¶
1. The Problem and The Concept¶
Until now, our discussion assumed the entire program (all code and data) must be in physical memory for execution. This limits a process's size to available physical RAM.
Dynamic Loading is a technique to improve memory utilization. The core idea: Do not load a subroutine/routine into memory until it is actually called (invoked).
2. How Dynamic Loading Works¶
- The main program is loaded into memory and begins execution. All other routines (subroutines, functions, modules) remain on disk in a relocatable load format (ready to be loaded and have addresses adjusted).
- When the main program (or a loaded routine) needs to call another routine, it first checks a routine table to see if that routine is already in memory.
- If the routine is not loaded, the program calls the relocatable linking loader.
- The loader finds the routine on disk.
- It loads the routine into an available block of memory.
- It updates the program's address tables to reflect the new location of this routine.
- Control is then transferred to the newly loaded routine.
3. Advantages and Responsibilities¶
- Advantage: Memory is used only for code that is actually needed. This is extremely useful for large programs with significant portions dedicated to infrequent operations (e.g., error handling routines, special features). The total program size on disk can be large, but its memory footprint can be much smaller.
- Responsibility: Dynamic loading primarily does not require special OS hardware support. It is a programming design technique. The programmer (or compiler tools) must structure the program to facilitate this loading-on-call.
- OS Role: The operating system may provide library routines to help programmers implement dynamic loading more easily, but the core mechanism is managed by the program itself.
Section 9.1.5: Dynamic Linking and Shared Libraries¶
1. Linking vs. Loading¶
It's vital to distinguish this from dynamic loading:
- Linking: The process of resolving references between different code modules (e.g., your program calling
printf()). It connects your code to library code. - Loading: The process of copying code/data from disk into memory.
Static Linking (the traditional method) combines all library code your program needs directly into the program's executable file at compile/link time. This creates large executables and wastes memory if multiple running programs include the same library code.
2. The Concept of Dynamic Linking¶
Dynamic Linking postpones the linking step until execution time (run time). The executable file contains only references to external library routines, not the library code itself.
These dynamically linked libraries are called Dynamically Linked Libraries (DLLs) in Windows or Shared Libraries (e.g., .so files) in Unix/Linux.
3. How Dynamic Linking Works¶
- When a dynamically linked program is loaded, the OS loader runs.
- The loader sees the program's references to DLLs. It locates the required libraries on disk.
- The loader checks if the library is already loaded in memory (perhaps by another running program). If not, it loads the library.
- The loader then patches (adjusts) the program's address tables so that calls to library functions (like
printf) point to the correct memory address where the library is now located.
4. Key Advantages of Dynamic Linking & Shared Libraries¶
- Reduced Executable File Size: The executable file on disk is smaller because it doesn't contain the library code.
- Efficient Memory Use (Sharing): This is the major advantage. Only one physical copy of a library routine needs to be in RAM, even if it is used by dozens of different processes. All processes map their logical address for
printfto the same physical memory location. This saves tremendous amounts of memory. (We will see exactly how this sharing is implemented when we discuss Paging in Section 9.3.4). - Easier Library Updates: A library (e.g., for bug fixes or performance improvements) can be replaced on disk. All programs that use it will automatically start using the new version the next time they run, without needing to be recompiled or relinked.
- Version Management: To prevent new, incompatible library versions from breaking old programs, version information is embedded. Multiple versions of the same library can coexist in memory. A program is linked to a specific major version number and will use that version, protecting system stability.
5. Essential Operating System Role¶
Unlike dynamic loading, dynamic linking absolutely requires active help from the operating system. Why?
- Memory Protection: In a protected memory system, one process cannot normally access another process's memory. Only the OS (in kernel mode) can set up the necessary mappings to allow multiple processes to access the same physical memory pages (the shared library code).
- Coordination & Security: The OS must manage the shared library's load address, enforce access permissions (typically read-only for code), and ensure security.
The detailed mechanism of how multiple processes share the same physical pages for code will be covered in Section 9.3.4 (Paging).
Section 9.2: Contiguous Memory Allocation¶
1. The Memory Allocation Problem¶
The OS must share the finite resource of physical main memory (RAM) between:
- The Operating System kernel itself.
- Multiple User Processes.
The goal is to allocate memory to processes efficiently (minimizing waste) and fairly. This section introduces a foundational model: Contiguous Memory Allocation.
2. Memory Layout: The Two Partitions¶
Physical memory is conceptually split into two major regions (partitions):
- Operating System Partition: Holds the kernel code, data, and kernel structures.
- User Processes Partition: The remaining memory where user applications are loaded.
Placement of the OS: The OS can be placed in low memory (addresses starting at 0) or high memory (addresses at the top of the physical address space). The choice depends on factors like where the hardware's interrupt vector (a table of addresses for handling hardware events) is located. Most modern OSes (like Linux and Windows) place the OS in high memory. Our discussion will assume this high-memory OS layout.
3. What is "Contiguous Allocation"?¶
In the contiguous allocation model:
- Each process is allocated one single, continuous block of memory addresses.
- The block for one process is placed next to (contiguous with) the block for the next process in memory.
- This creates a sequence of processes in memory, like cars parked bumper-to-bumper.
Before we can manage these allocations, we must ensure memory protection between these adjacent processes.
Section 9.2.1: Memory Protection¶
1. The Combined Hardware Mechanism¶
Protection is achieved by using both hardware registers we previously discussed, now working together as a pair within the MMU:
- Relocation Register (Base Register): Holds the starting physical address of the process.
- Limit Register: Holds the size (length) of the process's logical address space.
These two registers define the process's allocated memory block completely and enforce protection.
2. How Translation and Protection Work (Step-by-Step)¶
For every memory address generated by the CPU (logical address):
- The MMU compares the logical address against the limit register.
- If
logical address >= limit, the address is illegal. The MMU triggers a trap (fault) to the OS.
- If
- If the address is legal (
logical address < limit), the MMU adds the logical address to the value in the relocation register to produce the physical address. - This physical address is sent to the memory bus for the actual read/write operation.
Go to Figure 9.6 in your book. This figure illustrates this combined check-and-translate operation within the MMU hardware.
Example: Relocation = 100040, Limit = 74600.
- Process's logical address space is from 0 to 74599.
- Process's physical address space is from 100040 to (100040 + 74599) = 174639.
- A CPU reference to logical address
0→ passes limit check (<74600) → becomes physical address100040. - A CPU reference to logical address
74600→ fails limit check (>=74600) → triggers a protection fault.
3. Role During Context Switches¶
When the OS switches the CPU from one process to another (a context switch), the dispatcher (part of the OS scheduler) must:
- Save the old process's relocation and limit register values (part of its saved state).
- Load the new process's correct relocation and limit values into the physical MMU registers. This ensures each process only has access to its own specific, contiguous block of memory.
4. Dynamic Flexibility for the Operating System¶
This scheme also benefits the OS itself. The OS's own memory size can change dynamically.
- Example: A device driver for a printer doesn't need to be in memory if no printer is connected. It can be kept on disk.
- When the printer is plugged in, the OS can load the driver code into a free area of the OS partition in high memory.
- If the printer is disconnected, the OS can remove (unload) the driver, freeing that memory for other OS needs (like caches or another driver).
- The user processes are unaffected because their relocation/limit registers still define their own separate, protected region in lower memory.
In summary, the relocation+limit register pair provides the essential hardware foundation for both dynamic address translation and strict memory protection in a contiguous allocation model. The next step is to explore the algorithms the OS uses to choose where to place these contiguous blocks in the free memory space.
Section 9.2.2: Memory Allocation¶
1. The Variable-Partition Scheme¶
We now focus on how the OS decides where to place a new process in the user memory partition. The model is Contiguous Allocation with Variable-Sized Partitions.
- Each process gets one contiguous block of memory.
- Block sizes vary according to the process's actual needs.
- The OS maintains a data structure (a table or list) tracking all regions of memory: which are allocated (to processes) and which are free (available holes).
2. The Lifecycle of Memory: Holes and Fragmentation¶
Go to Figure 9.7 in your book. This figure perfectly illustrates the dynamic nature of memory under this scheme.
- Initial State: Memory is full of processes (e.g., P5, P8, P2).
- Process Termination: When a process finishes (e.g., P8), it releases its memory, creating a hole—a contiguous block of free memory.
- New Process Arrival: A new process (e.g., P9) arrives. The OS must find a hole large enough to fit P9. If found, P9 is loaded there.
- More Terminations: Another process (P5) terminates, creating a second hole. Now free memory is fragmented into two non-contiguous holes.
This cycle of allocation and deallocation leads to a memory landscape scattered with holes of various sizes.
3. Handling Insufficient Memory¶
What if a new process arrives and no single hole is large enough to hold it?
- Option 1: Reject the Process. The OS can refuse to admit the process, often with an "out of memory" error.
- Option 2: Use a Wait Queue. The process is placed in a suspended queue. Whenever a process terminates and frees memory, the OS scans this queue to see if any waiting process can now fit into the newly created (and possibly merged) hole.
4. The Core Allocation/Deallocation Procedures¶
The OS constantly performs these operations:
A. Allocating Memory for a New Process:
- Search: The OS searches its list of free holes for one that is large enough to hold the process.
- Split: If the chosen hole is larger than needed, it is split into two parts:
- One part (of the exact size required) is allocated to the new process.
- The remaining part becomes a new, smaller hole returned to the free list.
B. Deallocating Memory on Process Termination:
- Release: The process's memory block is marked as free, becoming a new hole.
- Merge (Coalesce): The OS immediately checks if this new hole is adjacent to any other existing hole(s). If it is, it merges (coalesces) them into a single, larger hole. This is critical to combat external fragmentation (discussed next).
5. The Dynamic Storage-Allocation Problem¶
This is a classic problem: Given a list of free holes and a request for n bytes, which hole should we choose?
Three common placement algorithms are used:
Algorithm 1: First-Fit¶
- Rule: Allocate the first hole in the list that is big enough.
- Implementation: Search can start at the beginning of the list each time, or from where the last first-fit search ended (a next-fit variant).
- Advantage: Fast. The search stops at the first acceptable hole.
- Disadvantage: Can lead to uneven fragmentation over time.
Algorithm 2: Best-Fit¶
- Rule: Allocate the smallest hole that is big enough to satisfy the request.
- Implementation: Must search the entire free list (unless it's pre-sorted by size).
- Advantage: Tries to minimize wasted space in the chosen hole, leaving the largest possible leftover blocks intact.
- Disadvantage: Slower (requires full scan). Ironically, it often creates many tiny, useless leftover holes ("external fragmentation").
Algorithm 3: Worst-Fit¶
- Rule: Allocate the largest available hole.
- Implementation: Must also search the entire free list (unless sorted).
- Rationale: The leftover hole after allocation will be as large as possible, hopefully still useful for future requests.
- Disadvantage: Simulations show it generally performs the worst in terms of memory utilization and speed. It tends to quickly break down large free blocks.
6. Performance Comparison¶
- Speed: First-Fit is generally the fastest.
- Storage Utilization (Minimizing Fragmentation):
- Both First-Fit and Best-Fit are better than Worst-Fit.
- There is no clear winner between First-Fit and Best-Fit in pure utilization; it depends on the exact workload. However, Best-Fit's tendency to create tiny fragments is a significant practical drawback.
- Practical Choice: Due to its speed and acceptable utilization, First-Fit (or Next-Fit) is often the algorithm of choice in simple systems.
The fundamental flaw of all these contiguous allocation schemes, regardless of the placement algorithm, is external fragmentation—the existence of many small, non-contiguous free holes scattered throughout memory, which together may have enough total free space for a process, but no single hole is large enough. This is the major drawback that motivates more advanced memory management schemes like Paging.
Section 9.2.3: Fragmentation¶
1. The Core Problem: External Fragmentation¶
All variable-partition contiguous allocation schemes (First-Fit, Best-Fit, Worst-Fit) suffer from External Fragmentation.
- Definition: External fragmentation occurs when the total free memory is sufficient to satisfy a request, but it is not available in a single, contiguous block. Instead, free memory is scattered as many small, separate holes between allocated processes.
- Severity: This can be crippling. In the worst case, every two processes could have a small, unusable hole between them. If all these holes were merged, they could form a block large enough to run another process.
- The 50-Percent Rule: Statistical analysis shows a grim pattern. For
Nallocated blocks using First-Fit, approximately an additional 0.5N blocks worth of memory will be lost to fragmentation. This implies up to one-third of total memory may be unusable at any time, purely due to fragmentation overhead.
Which algorithm affects fragmentation? The choice of First-Fit vs. Best-Fit influences the pattern and size of fragments, but neither eliminates external fragmentation. Best-Fit is particularly notorious for creating many tiny, useless holes.
2. A Different Problem: Internal Fragmentation¶
Fragmentation isn't only about holes between processes. Internal Fragmentation is waste inside an allocated memory block.
- Scenario: Imagine a system with a hole of exactly 18,464 bytes. A process requests 18,462 bytes. A perfect fit leaves a 2-byte hole.
- The Overhead Issue: The metadata needed to track this new 2-byte hole (pointers, size info) will likely consume more than 2 bytes, making it nonsensical to manage.
- The Common Solution: To avoid this, systems often allocate memory in fixed-size units (blocks). If a process requests memory, it gets rounded up to the next block size.
- Definition: The difference between the memory actually allocated to the process and the memory it requested is internal fragmentation—unused memory inside its allocated partition. This space is allocated to the process but cannot be used by it.
Key Difference: External Fragmentation = Unallocated free space between processes. Internal Fragmentation = Wasted space inside an allocated block.
3. Solution 1: Compaction¶
One direct attack on external fragmentation is Compaction.
- Goal: To shuffle all occupied memory areas (processes) together, collecting all free holes into one large, contiguous free block.
- Requirement: Compaction is only possible if relocation is dynamic (at execution time). The OS must be able to move a process in physical memory and simply update its base/relocation register so the process continues running unaware. Static (compile or load-time) binding makes compaction impossible.
- The Algorithm: The simplest method is to move all processes down (or up) to one end of memory, sliding them past any holes.
- The Major Drawback: Cost. Compaction is extremely expensive in terms of CPU time. Copying large blocks of memory is slow, and the system is typically frozen during the operation. It is often a last-resort or infrequent operation.
4. Solution 2: Noncontiguous Allocation (Paging)¶
A more fundamental and elegant solution to external fragmentation is to abandon the requirement that a process's address space be contiguous in physical memory.
- Core Idea: Allow a process to be allocated physical memory in multiple, small, fixed-size blocks that can be placed in any available physical location, scattered throughout memory.
- The Technique: This is the essence of Paging, the dominant memory-management technique in modern operating systems.
- How it Solves Fragmentation: Since memory is divided into small, fixed-size chunks (called frames), any free frame anywhere can be used for part of a process. External fragmentation in the classic sense is eliminated (though a related, minor form may exist). The small, fixed allocation unit also makes internal fragmentation more predictable and manageable.
5. The Bigger Picture¶
Fragmentation is a universal systems problem that appears whenever a resource (memory, disk space) must be managed in variable-sized blocks. The trade-offs between external and internal fragmentation, and the costs of compaction, lead directly to the design of more sophisticated systems like paging (Chapter 9.3) and segmentation, which you will study next. The same concepts reappear in file-system storage management (Chapters 11-15).
Section 9.3: Paging¶
1. Introduction: The Noncontiguous Solution¶
Previous memory management schemes required a contiguous physical address space for each process, leading to external fragmentation and the need for costly compaction.
Paging is a fundamental scheme that solves this by allowing a process's physical address space to be noncontiguous. This avoids external fragmentation and is the dominant memory management technique in modern OSes (servers, desktops, mobile devices). It requires cooperation between the OS software and the computer hardware (MMU).
Section 9.3.1: Basic Method¶
1. The Core Idea: Pages and Frames¶
Paging involves dividing memory into fixed-sized units:
- Frame: A fixed-size block of physical memory.
- Page: A block of logical memory of the exact same size as a frame.
How it works:
- When a process is executed, its pages (from the executable file on disk) are loaded into any available memory frames, which do not need to be contiguous.
- The backing store (disk) is also organized into blocks the same size as frames (or multiples thereof).
- Key Consequence: This completely separates the logical and physical address spaces. A process can use a massive 64-bit logical address space (e.g., 2^64 bytes) even if the machine has far less physical RAM. The OS and MMU manage the mapping illusion.
2. Address Translation: Splitting the Logical Address¶
Every logical address generated by the CPU is treated as a single number. The hardware divides it into two parts:
- Page Number (p): Identifies which page within the process's logical address space this address belongs to.
- Page Offset (d): Identifies the specific byte within that page.
Logical Address: | Page Number (p) | Page Offset (d) |
Go to Figure 9.8 in your book. This diagram shows the hardware (MMU) taking the logical address, splitting it, using p to index a page table, and combining the resulting frame number with d to get the physical address.
3. The Page Table: The Mapping Directory¶
- The Page Table is the data structure that performs the mapping. Each process has its own page table.
- It is indexed by the page number (p).
- Each entry in the page table contains the base address (frame number) where that particular page is currently located in physical memory.
4. The Paging Model Visualization¶
Go to Figure 9.9 in your book. This is a crucial visualization:
- Logical Memory (Left): Shows the process's view—pages 0, 1, 2, 3 in order.
- Page Table (Center): Shows the mapping. For example, Logical Page 1 → Physical Frame 4.
- Physical Memory (Right): Shows the reality—the pages are scattered in noncontiguous frames (e.g., Page 0 in Frame 1, Page 1 in Frame 4, etc.). The offset (
d) within each page/frame remains the same.
5. Step-by-Step Translation by the MMU¶
For each logical address, the MMU performs these steps automatically:
- Extract
pandd: The hardware divides the logical address into the page number (p) and page offset (d). - Consult Page Table: Use
pas an index into the process's page table to look up the corresponding physical frame number (f). - Form Physical Address: Replace the logical page number
pwith the physical frame numberf. The offsetdremains unchanged. The physical address is the concatenation:f+d.
Physical Address: | Frame Number (f) | Page Offset (d) |
6. Page Size and Efficient Address Splitting¶
- Page size is a hardware design choice, always a power of 2 (e.g., 4 KB, 2 MB, 1 GB). Using a power of 2 makes the address split extremely efficient in hardware.
- How the split works:
- Let the logical address space size be 2^
mbytes (so an address ismbits long). - Let the page size be 2^
nbytes. - Then, the page number (
p) uses the high-orderm - nbits of the logical address. - The page offset (
d) uses the low-ordernbits.
- Let the logical address space size be 2^
Example: If logical addresses are 32 bits (m=32) and page size is 4KB (2^12 bytes, n=12), then:
- The high 20 bits (
32-12) are the page numberp(allowing up to 2^20 pages). - The low 12 bits are the page offset
d(addressing each of the 4096 bytes in a page).
This split is performed by simple bit masking and shifting in hardware, making translation very fast.
Concrete Paging Example¶
Go to Figure 9.10 in your book. Let's walk through this small-scale example to solidify the concept.
Given:
- Logical Address Space: 16 bytes (0 to 15) → requires
m=4bits for addressing (2^4=16). - Page Size: 4 bytes → requires
n=2bits for the offset (2^2=4). Therefore, the page number (p) uses the highm-n = 2bits. - Physical Memory: 32 bytes, divided into 8 frames (0 to 7) of 4 bytes each.
- Page Table: As shown in the figure, maps logical pages to physical frames.
Address Translation Calculations:
Logical Address 0:
- Binary:
0000. High 2 bits00= Page 0. Low 2 bits00= Offset 0. - Page Table: Page 0 → Frame 5.
- Physical Address = (Frame 5 * Page Size 4) + Offset 0 = (5 * 4) + 0 = 20.
- Binary:
Logical Address 3:
- Binary:
0011. High 2 bits00= Page 0. Low 2 bits11= Offset 3. - Page Table: Page 0 → Frame 5.
- Physical Address = (5 * 4) + 3 = 23.
- Binary:
Logical Address 4:
- Binary:
0100. High 2 bits01= Page 1. Low 2 bits00= Offset 0. - Page Table: Page 1 → Frame 6.
- Physical Address = (6 * 4) + 0 = 24.
- Binary:
Logical Address 13:
- Binary:
1101. High 2 bits11= Page 3. Low 2 bits01= Offset 1. - Page Table: Page 3 → Frame 2.
- Physical Address = (2 * 4) + 1 = 9.
- Binary:
This example shows how the contiguous logical address space (0-15) is mapped to noncontiguous physical frames (5, 6, 2).
Paging as Dynamic Relocation¶
Paging is a sophisticated form of execution-time (dynamic) relocation.
- Instead of a single base register for the whole process, paging uses a page table—which acts like an array of base registers, one for each logical page of the process.
- Each page table entry provides the "base address" (frame number) for its corresponding page.
Fragmentation in Paging¶
This is a key result:
External Fragmentation: ELIMINATED.
- Since any free frame anywhere can be used for any page, there is no problem of needing a large contiguous free block. Free memory is a list of individual frames, not a fragmented hole structure.
Internal Fragmentation: STILL EXISTS (but is manageable).
- Memory is allocated in fixed-size frame units. A process's last page is unlikely to be exactly full.
- Example: Page size = 2048 bytes. Process size = 72,766 bytes.
- Number of pages needed = ceil(72,766 / 2048) = ceil(35.53) = 36 pages.
- Memory allocated = 36 * 2048 = 73,728 bytes.
- Internal Fragmentation = 73,728 - 72,766 = 962 wasted bytes (in the last frame).
- Worst Case: A process needing
npages + 1 byte getsn+1frames, wasting almost an entire frame.
On average, internal fragmentation is about half a page per process.
The Page Size Trade-Off¶
Choosing the page size is a critical design decision involving a trade-off:
Smaller Page Size:
- Advantage: Reduces internal fragmentation (wasted half-pages are smaller).
- Disadvantage: Increases the size of the page table. More pages per process means more page table entries, increasing memory overhead for the tables themselves. Also can lead to more frequent page faults (discussed later).
Larger Page Size:
- Advantage: Reduces page table size. Fewer entries are needed.
- Advantage: Improves disk I/O efficiency when reading/writing pages from/to disk (fewer, larger transfers are often faster).
- Disadvantage: Increases internal fragmentation.
Historical Trend & Modern Practice: Page sizes have grown over time as memory sizes and datasets have increased. Today, the standard page size is 4 KB or 8 KB. However, to get the benefits of large pages for certain applications (like large databases), most modern systems support multiple page sizes.
- x86-64 / Windows 10: Supports 4 KB and 2 MB "large pages".
- Linux: Supports the standard 4 KB page and huge pages (often 2 MB or 1 GB). Larger pages reduce page table overhead and TLB misses (you'll learn about the TLB next) for applications that use large, contiguous memory regions.
Page Table Entry Size and Physical Memory Addressing¶
A Page Table Entry (PTE) must store the physical frame number and additional control bits.
- Typical Size: On a 32-bit CPU, a PTE is often 4 bytes (32 bits) long.
- Addressing Power: With 32 bits, an entry can point to any of 2^32 different frames.
- Maximum Physical Memory: If frame size = 4 KB (2^12 bytes), then:
- Physical address = (Frame Number * Frame Size) + Offset.
- The frame number (32 bits) + offset (12 bits) = 44-bit physical address.
- This can address 2^44 bytes = 16 Terabytes (TB) of physical memory.
- Important Nuance: The maximum logical address space for a single process (e.g., 32 bits = 4 GB) is different from the total physical memory the system can support (e.g., 16 TB). Paging decouples these two concepts.
- Reduced Bits for Framing: In reality, PTEs contain extra control bits (like valid/invalid, protection, dirty, reference) which use some of the 32 bits. This slightly reduces the number of bits available for the frame number, capping the addressable physical memory below the theoretical 16 TB maximum for a 4-byte PTE.
Process Loading and Frame Allocation¶
When a process is scheduled to run:
- The OS checks the process size in pages.
- It checks if at least that many free frames exist in physical memory.
- If frames are available, the OS allocates them to the process. These frames can be anywhere in memory.
- The OS loads each page from disk into its allocated frame.
- For each page, the OS creates an entry in the process's page table, storing the allocated frame number.
Go to Figure 9.11 in your book. This figure shows the process:
- (a) Before Allocation: A free-frame list (e.g., frames 13, 14, 18, 20, 15).
- (b) After Allocation: The new process's pages (0,1,2,3) are loaded into frames 14, 13, 18, 20 respectively. The free-frame list now only has frame 15. The process's page table is built with the mappings.
The Paging Abstraction and Protection¶
Paging creates a powerful abstraction:
- Programmer/Process View: A single, contiguous, private address space starting at address 0.
- Physical Reality: The process's code and data are scattered across various physical frames, intermingled with frames belonging to other processes.
- The Enforcer: The address-translation hardware (MMU) reconciles these views transparently. It is impossible for a process to generate a physical address outside its own page table mappings, providing strong memory protection automatically. A process cannot even name (address) memory belonging to the OS or other processes.
Operating System Responsibilities¶
The OS has crucial management duties:
A. The Frame Table:
- The OS maintains a single, system-wide data structure called the frame table.
- It has one entry per physical frame in the entire system.
- Each entry tracks: whether the frame is free or allocated, and if allocated, which process and which page of that process occupies it. (This is essential for page replacement algorithms in virtual memory).
B. Managing Addresses in System Calls:
- When a user process passes a memory address (e.g., a buffer pointer) as a parameter to a system call (like
read()orwrite()), that address is a logical address. - The OS (running in kernel mode) must translate that logical address into the corresponding physical address to access the correct user data. It does this by consulting the process's page table.
C. Maintaining Process Page Tables:
- The OS keeps a master copy of each process's page table in kernel memory.
- This copy is used for:
- Manual address translation during system calls.
- Setting up the hardware page-table pointer (e.g., the
CR3register on x86) during a context switch—the dispatcher loads this register to point to the new process's page table before giving it the CPU.
- Consequence: Because the page table must be loaded/reloaded on each context switch, paging increases context-switch time compared to simpler schemes, as there is more hardware state to manage.
Linux Tip: Obtaining Page Size On a Linux system, you can find the system's page size via:
- System Call: Using
getpagesize()in a C program. - Command Line: Using the command
getconf PAGESIZE. Both return the page size in bytes (e.g., 4096).
Section 9.3.2: Hardware Support¶
1. Page Tables as Process State¶
The page table is a critical piece of per-process hardware state.
- A pointer to the process's page table (often called the Page Table Base Register or similar) is stored in the Process Control Block (PCB) along with the instruction pointer, general registers, etc.
- During a context switch, the OS scheduler's dispatcher must:
- Save the current process's page-table pointer.
- Load the page-table pointer of the next process to run into the appropriate hardware register.
- This tells the MMU (Memory Management Unit) where in physical memory to find the new process's mapping directory.
2. Hardware Implementation Options for Page Tables¶
There are two primary ways to implement the page table in hardware, representing a classic speed vs. flexibility trade-off.
Option 1: Page Table in Dedicated Hardware Registers¶
- How it works: The entire page table is stored in a set of very high-speed registers inside the CPU/MMU.
- Advantage: Extremely fast translation. Every address translation requires just a register lookup, which is nearly instantaneous.
- Major Disadvantage: High context-switch overhead. Switching to a new process requires copying all page table entries from memory into these registers, which is slow if the page table is large. This makes context switches very expensive.
- Feasibility: This is only practical for very small page tables (e.g., 256 entries or less). It was used in older or simpler architectures.
Option 2: Page Table in Main Memory (The Standard Approach)¶
- How it works: The page table is stored in ordinary main memory (RAM).
- Hardware Pointer: A single, dedicated CPU register called the Page-Table Base Register (PTBR) holds the starting physical address of the page table in memory.
- Advantage: Low context-switch overhead. To switch processes, the OS only needs to update the single PTBR to point to the new process's page table in memory. This is very fast.
- Disadvantage: Slower address translation. Every memory access now logically requires two physical memory accesses:
- One to read the page table entry from memory (at the address
PTBR + p * entry_size). - A second to perform the actual desired read/write at the translated physical address. This would halve performance, which is intolerable.
- One to read the page table entry from memory (at the address
- The Crucial Solution: The performance penalty of the two-memory-access problem is solved by a special high-speed cache called the Translation Lookaside Buffer (TLB), which you will learn about next. The PTBR-in-memory scheme, combined with a TLB, is the standard in all modern systems.
Section 9.3.2.1: Translation Look-Aside Buffer (TLB)¶
1. The Performance Problem with Memory-Resident Page Tables¶
Storing the page table in main memory keeps context switches fast but creates a critical performance issue. Each memory access now logically requires:
- First Memory Access: Read the Page Table Entry (PTE) from main memory (using the PTBR and page number
p). - Second Memory Access: Perform the actual data read/write at the translated physical address.
This doubles the memory access time, which is unacceptable. A solution is needed to make translation fast while keeping page tables large and in memory.
2. The Solution: TLB as a Translation Cache¶
The Translation Look-aside Buffer (TLB) is the solution. It is a special, small, extremely fast hardware cache inside the MMU that stores recent page-to-frame translations.
- Nature: It is an associative memory (content-addressable memory).
- Entry Structure: Each TLB entry has two parts:
- Key/Tag: The Virtual Page Number.
- Value: The corresponding Physical Frame Number, plus protection/status bits.
- Operation: When presented with a virtual page number, the TLB compares it against all keys simultaneously. If a match is found (a TLB hit), the corresponding frame number is returned immediately.
- Size & Speed: To be fast enough, the TLB must be small (typically 32 to 1024 entries). Lookup is designed to be part of the CPU's instruction pipeline, adding almost zero delay.
3. TLB Hit vs. TLB Miss¶
Go to Figure 9.12 in your book. This diagram shows the complete flow, which we will describe step-by-step.
Case A: TLB Hit
- CPU generates a logical address (page
p, offsetd). - MMU presents page
pto the TLB. - TLB finds a matching entry (a hit).
- MMU instantly retrieves the frame number
f. - MMU forms physical address (
f+d) and accesses memory. This is as fast as a non-paged memory access.
Case B: TLB Miss
- CPU generates logical address.
- MMU presents page
pto the TLB. - TLB does not have an entry for
p(a miss). - The MMU must now perform the standard page table walk: it uses the PTBR to find the page table in main memory and looks up the PTE for page
p. This requires one (or more) slow memory accesses. - Once the frame number
fis retrieved from memory, the MMU:- Uses it to form the physical address and access the data.
- Crucially, it adds the new mapping (
p→f) to the TLB for future use.
- If the TLB is full, an existing entry must be evicted using a replacement policy (e.g., LRU, round-robin, random). Some OS-critical entries can be wired down (locked) to prevent eviction.
4. Managing Multiple Processes: The ASID¶
The TLB must correctly handle multiple processes, each with its own page table mapping the same virtual page numbers to different physical frames.
Solution: Address-Space Identifiers (ASIDs)
- An ASID is a unique number assigned to each process by the OS.
- Each TLB entry is tagged with the ASID of the process that owns the translation.
- During translation, the MMU checks that the ASID in the TLB entry matches the ASID of the currently running process.
- If ASIDs match: Translation is valid. ✅
- If ASIDs don't match: It's treated as a TLB miss, even if the page number matches. This prevents a process from using another process's translations.
Without ASIDs: The TLB must be completely flushed (erased) on every context switch to prevent the new process from using stale translations, causing many TLB misses after a switch. ASIDs allow the TLB to hold entries for multiple processes simultaneously, greatly improving performance.
5. Performance Analysis: Hit Ratio and Effective Access Time¶
- Hit Ratio: The percentage of memory accesses that result in a TLB hit. A high hit ratio (e.g., >99%) is crucial for performance.
- Effective Memory-Access Time (EAT) Calculation: We calculate the weighted average of hit and miss times.
Example (Simplified):
Memory access time = 10 ns.
TLB lookup time = 0 ns (assumed part of pipeline).
Hit Ratio = 80%
- TLB Hit (80%): Access memory once = 10 ns.
- TLB Miss (20%): Access page table in memory (10 ns) + then access data (10 ns) = 20 ns.
- EAT = (0.80 × 10) + (0.20 × 20) = 12 nanoseconds. (A 20% slowdown.)
Hit Ratio = 99% (More realistic for a good TLB)
- EAT = (0.99 × 10) + (0.01 × 20) = 10.1 nanoseconds. (Only a 1% slowdown.)
This shows why a high TLB hit ratio is essential—it makes paging overhead negligible.
6. Modern Complexity: Multi-Level TLBs¶
Modern CPUs (like Intel Core i7) have hierarchical TLBs, similar to multi-level CPU caches:
- L1 TLB: Very small and fast (e.g., 64-128 entries), split into separate Instruction TLB (ITLB) and Data TLB (DTLB).
- L2 TLB: Larger but slower (e.g., 512-2048 entries), unified.
- On a miss in all TLBs, the CPU must perform a costly page-table walk in memory (hundreds of cycles) or trigger an OS handler.
7. OS and Hardware Co-Design¶
TLBs are hardware, but the OS must be designed to use them optimally.
- The OS designer must understand the TLB characteristics of the target CPU (size, replacement policy, ASID support).
- The OS's paging implementation (e.g., how it structures page tables) must align with the hardware's TLB and page-walking mechanisms.
- Changes in CPU TLB design across generations may require changes in the OS to maintain performance.
In summary, the TLB is the essential hardware component that makes paging with memory-resident page tables viable by caching translations and achieving a very high hit rate, keeping the performance penalty of address translation extremely low.
Section 9.3.3: Protection¶
1. Page-Level Protection Bits¶
In a paged system, memory protection is integrated into the page table entries (PTEs). Each PTE contains protection bits associated with its corresponding frame.
- Basic Protection: A common bit defines if the page is read-only or read-write.
- Mechanism: On every memory access, the MMU performs the address translation. As it retrieves the frame number from the PTE, it simultaneously checks the protection bits.
- Violation: If the access violates the protection (e.g., a write to a read-only page), the MMU triggers a hardware trap (fault) to the operating system, typically resulting in a segmentation fault for the offending process.
- Extended Protection: Hardware can support finer-grained bits, such as execute-only (common for security, to prevent data from being executed as code), or separate bits for read, write, and execute to allow any combination.
2. The Valid-Invalid Bit¶
One of the most important protection bits is the valid-invalid bit.
- Valid (
v): Indicates that the page is within the process's logical address space and is a legal page that the process is allowed to access. - Invalid (
i): Indicates that the page is not within the process's logical address space. An access to such a page is illegal and causes a trap to the OS.
The OS sets this bit for each page table entry when setting up or modifying a process's address space.
3. Concrete Example: Valid vs. Invalid Pages¶
Go to Figure 9.13 in your book. Let's analyze this example carefully.
- System Specs: 14-bit logical address space → addresses 0 to 16383. Page size = 2 KB (2048 bytes).
- Program Size: Uses addresses 0 to 10468 only.
- Page Calculation:
- Page size = 2048 bytes.
- Program uses ceil(10469 / 2048) = ceil(5.11) = 6 pages (Pages 0-5).
- However, the full 14-bit address space contains 16384 / 2048 = 8 pages (Pages 0-7).
- Page Table Setup (See Figure):
- Entries for Pages 0-5 are marked valid (
v). They map to some physical frames. - Entries for Pages 6 and 7 are marked invalid (
i).
- Entries for Pages 0-5 are marked valid (
- Address Validation:
- Access to address 5000 (in Page 2): Valid → Allowed.
- Access to address 12288 (first byte of Page 6): Invalid → Trap to OS.
- The "Problem" / Nuance: Notice that the last valid page (Page 5) contains addresses 10240 to 12287. Since the program only goes up to 10468, addresses 10469 through 12287 in Page 5 are also "illegal" for the program, but are still marked as valid. This is because protection is enforced at the page granularity, not byte granularity. The invalid region starts only at the next page boundary (Page 6). This is a direct consequence of internal fragmentation—the entire last page (Page 5) is allocated, even though part of it is unused.
4. The Page-Table Length Register (PTLR)¶
Storing entries for all possible pages in a large address space (e.g., 2^20 entries for a 32-bit space with 4KB pages) is wasteful if a process uses only a small fraction.
- Hardware Support: Some architectures provide a Page-Table Length Register (PTLR).
- Function: The PTLR holds the size (number of entries) of the current process's page table.
- Protection Check: For every logical address, the hardware compares the extracted page number (
p) against the PTLR value.- If
p≥ PTLR, the page number is out of bounds, causing an immediate trap. This check happens before the page table lookup.
- If
- Benefit: This allows the OS to allocate a page table only as large as needed for the process's highest-used page, saving memory. Entries for higher, unused page numbers simply don't exist.
Summary: Paging provides natural and efficient memory protection. The protection bits control the type of access (read/write/execute), the valid-invalid bit defines the legal address space, and hardware like the PTLR can optimize table size. Protection is enforced automatically by the MMU on every access, trapping any violation to the OS.
Section 9.3.4: Shared Pages¶
1. The Need for Sharing Code¶
A major advantage of paging is the ability for multiple processes to share identical code in physical memory. This is crucial for system libraries used by almost every program (like the standard C library, libc).
- Without Sharing: If 40 processes each load their own 2 MB copy of
libc, total memory use = 80 MB. - With Sharing: If all 40 processes share one physical copy of
libc, total memory use for the library = only 2 MB. This is a dramatic saving.
2. Requirement: Reentrant (Pure) Code¶
Sharing is only possible for reentrant code (also called pure code or non-self-modifying code).
- Definition: Code that never changes during execution. It contains only instructions (no self-modifying instructions) and constants. Any process-specific data must be stored in a separate, private data section.
- Why? If shared code were modifiable, one process could change an instruction, corrupting execution for all other sharing processes.
- Enforcement: The OS must enforce this by marking shared code pages as read-only in the page table protection bits. A write attempt by any process will trigger a protection fault.
3. How Page Sharing Works¶
Go to Figure 9.14 in your book. This illustrates the mechanism.
- Physical Memory: A single copy of the
libclibrary occupies a set of physical frames (e.g., frames 4, 1, 6, 3). - Process Page Tables: Each process (
P1,P2,P3) has its own, independent page table. - Mapping: In each process's page table, the entries for the
libcpages are mapped to the same physical frame numbers (4, 1, 6, 3). - Private Data: Each process's private data pages (like
P1's page 0,P2's page 0) are mapped to different physical frames, ensuring isolation. - Result: All processes execute the same physical machine code from frames 4, 1, 6, and 3, but maintain separate private data areas.
4. What Can Be Shared?¶
- System Libraries: The primary use case (e.g.,
libc,libm). - Common Programs: Compilers, database systems, window system code.
- Shared Memory (IPC): Some operating systems implement the shared memory interprocess communication mechanism (from Chapter 3) using shared pages. Processes can map a region of their address space to the same physical pages to exchange data at memory speed.
- Kernel Code: The kernel itself is often shared among all processes via paging, mapped into a protected region of every address space.
5. Relation to Other Concepts¶
- Dynamic Linking & Shared Libraries (Section 9.1.5): This is the implementation mechanism. When a program is dynamically linked to
libc, the OS loader sets up the process's page table to map to the already-loaded, shared physical pages oflibc, rather than loading a separate copy. - Threads (Chapter 4): Threads within the same process naturally share all code and data pages of their process, which is an even tighter form of memory sharing.
- Virtual Memory Benefits (Chapter 10): This sharing capability is one of several powerful benefits enabled by the paging abstraction, which you will explore further.
In summary, shared pages via paging enable massive memory savings and efficient program construction by allowing multiple processes to map their logical pages to the same physical frames containing reentrant code, enforced by hardware protection bits.
Section 9.4: Structure of the Page Table¶
1. The Problem: Large Page Tables¶
Modern systems have huge logical address spaces (32-bit to 64-bit). This leads to massive page tables if implemented as a simple linear array.
Example (32-bit system, 4 KB pages):
- Logical Address Space: 2^32 bytes.
- Page Size: 2^12 bytes (4 KB).
- Number of Pages: 2^32 / 2^12 = 2^20 = ~1 million pages.
- Page Table Size: 1 million entries * 4 bytes per entry = 4 MB per process. Allocating 4 MB of contiguous physical memory for each process's page table is wasteful and impractical. We need smarter page table structures.
Section 9.4.1: Hierarchical Paging (Multi-Level Page Tables)¶
1. The Core Idea: Page the Page Table¶
The solution is to apply paging to the page table itself. Instead of one huge, contiguous page table, we break it into smaller pieces (pages) and only keep the needed pieces in memory. This creates a hierarchical (tree-like) structure.
2. Two-Level Paging (Forward-Mapped Page Table)¶
This is the most common hierarchical scheme for 32-bit architectures.
Go to Figure 9.15 in your book for a visual of the structure, and Figure 9.16 for the translation process.
Example (32-bit address, 4 KB pages):
- Logical address split originally: 20-bit page number (
p), 12-bit offset (d). - New Split with Two-Level Paging:
- Divide the 20-bit page number further:
p1(10 bits): Index into the outer page table (also called the page directory).p2(10 bits): Index into an inner page table (a page of page-table entries).
d(12 bits): Page offset (unchanged).
- Divide the 20-bit page number further:
Logical Address: | p1 (10 bits) | p2 (10 bits) | d (12 bits) |
How Translation Works (Follow Figure 9.16):
- The Page-Directory Base Register (PDBR) holds the physical address of the outer page table.
- Use
p1as an index into the outer page table. This entry gives the physical address of the inner page table page. - Use
p2as an index into that inner page table page. This entry gives the physical frame number of the desired data page. - Combine this frame number with offset
dto get the final physical address.
Advantages:
- No large contiguous allocation needed. The outer table is small (e.g., 2^10 entries = 4 KB), and inner table pages are allocated only as needed.
- Saves memory. If a process uses only a sparse region of its address space, only the corresponding inner page tables need to be allocated. Unused regions just have invalid entries in the outer table.
3. The 64-Bit Address Space Problem¶
For 64-bit architectures, two-level paging becomes insufficient.
Example (64-bit, 4 KB pages):
- Logical Address Space: 2^64 bytes (massive).
- Page Size: 2^12 bytes.
- Number of Pages: 2^(64-12) = 2^52 pages (over 4 quadrillion).
- Even with two-level paging:
- Let inner table fit in one page (1024 entries = 10 bits for
p2). - Then outer table needs 2^(52-10) = 2^42 entries.
- That's 16 Terabytes just for the outer page table! This is obviously impossible.
- Let inner table fit in one page (1024 entries = 10 bits for
4. Deeper Hierarchies: Three-Level and Beyond¶
The solution is to add more levels, paging the outer tables themselves.
- Three-Level Paging: Outer page table → Middle page table → Inner page table → Data Page.
- Four-Level Paging: Adds another level.
- Extreme Case (64-bit UltraSPARC): Required seven levels of paging.
The Major Drawback: Each additional level adds one extra memory access for translation (unless cached in the TLB). With seven levels, a single memory access could theoretically require eight memory reads (seven for page tables, one for data), which is prohibitively slow.
5. Conclusion for Hierarchical Paging¶
- Appropriate for: 32-bit systems (typically 2 levels) and some 64-bit systems with a limited virtual address width (e.g., 48-bit, using 3-4 levels).
- Inappropriate for: Full 64-bit address spaces with many levels, due to high translation overhead and table size.
This limitation for 64-bit systems motivates alternative page table structures like Hashed Page Tables and Inverted Page Tables, which you will study next.
Section 9.4.2: Hashed Page Tables¶
1. Motivation: Handling Large Address Spaces¶
Hierarchical page tables become inefficient for very large (e.g., full 64-bit) address spaces due to excessive depth or table size. The Hashed Page Table is an alternative structure designed for these large spaces. It uses a hash table data structure to store page table entries.
2. Structure of a Hashed Page Table¶
The core idea is to use the virtual page number as a key to hash into a table.
- Hash Table: An array of buckets, each being the head of a linked list (to handle hash collisions).
- Linked List Entry (Element): Each node in the list contains three fields:
- Virtual Page Number: The key used for exact matching.
- Mapped Physical Frame Number: The translation value.
- Pointer to Next Element: For collision chaining.
Go to Figure 9.17 in your book. This figure shows the translation flow: the virtual page number (p) is hashed, leading to a bucket and a linked list traversal to find the matching entry, yielding the frame number (r).
3. Translation Algorithm¶
For a given logical address (virtual page number p, offset d):
- Hash: Apply a hash function to the virtual page number
p. This produces an index into the hash table. - Traverse the Linked List: Go to the corresponding bucket (linked list). Traverse the list, comparing the virtual page number
pwith the Virtual Page Number field (1) in each entry. - Match Found (Hit): When a matching entry is found, extract the Physical Frame Number (field 2). Combine it with offset
dto form the physical address. - No Match (Miss): If the end of the list is reached without a match, it's a page fault (the page is not in memory), triggering the OS page-fault handler.
4. Advantages and Considerations¶
- Efficient for Sparse Address Spaces: It only allocates memory for entries that actually exist (i.e., pages that are valid and in memory), unlike hierarchical tables which allocate whole pages of entries.
- Potentially Constant Lookup Time: With a good hash function and low collisions, average lookup time can be near O(1), though worst-case (many collisions) can be O(n) for list length.
- Hardware/Software Cooperation: Often, the hash table is managed by the operating system in software. The hardware (MMU) may have a dedicated register pointing to the start of the hash table. On a TLB miss, a hardware page-table walker unit may perform this hash lookup in hardware, or it may trap to an OS software handler.
5. Enhancement: Clustered Page Tables¶
A variation designed specifically for 64-bit address spaces.
- Key Difference: Each entry in the hash table (each node in the linked list) maps not just one, but a cluster of consecutive virtual pages (e.g., 16 pages) to a corresponding set of physical frames.
- Why it Helps:
- Reduces Table Size: A single entry can store 16 mappings, shrinking the total number of required entries.
- Improves Locality: If a process accesses a virtual page, it's likely to access neighboring pages soon. A clustered entry brings those translations into the TLB or cache together.
- Excellent for Sparse Addresses: Even if references are scattered, grouping them into clusters minimizes the number of separate hash table entries needed.
- Use Case: Ideal for very large, sparse address spaces where memory references are noncontiguous and spread out over vast virtual memory ranges.
Section 9.4.3: Inverted Page Tables¶
1. The Problem with Standard Page Tables¶
Standard (forward) page tables have one entry per virtual page per process. In a system with many processes and huge address spaces, the combined size of all page tables can consume a significant portion of physical memory itself. This is wasteful.
2. The Core Idea of Inverted Page Tables¶
The inverted page table takes a radically different approach.
- It has only one system-wide page table.
- It has exactly one entry for each physical frame (page frame) in the system's RAM.
- Each entry tells us: Which process's virtual page is currently occupying this physical frame.
Contrast: A standard page table maps Virtual Page → Physical Frame. An inverted page table maps Physical Frame → (Process, Virtual Page).
Go to Figure 9.18 in your book and compare it with Figure 9.8. This highlights the inversion: instead of many tables indexed by virtual page number, there is one table indexed by physical frame number.
3. Structure and Translation Process¶
Entry Format: Each entry contains:
- Process Identifier (PID) / Address-Space Identifier (ASID): Identifies the process that owns the page.
- Virtual Page Number: The virtual page from that process's address space that is stored in this frame.
Logical Address Format: Becomes a triple: <Process-ID, Virtual-Page-Number, Offset>.
Translation Algorithm (Given a logical address <pid, p, d>):
- Search: The system must search the entire inverted page table for an entry where
(PID == pid) AND (Virtual-Page-Number == p). - Match Found: If found at entry
i, theniis the physical frame number. The physical address is<i, d>. - No Match: If not found, it's a page fault (the page is not in memory).
The Critical Performance Problem: The inverted page table is sorted/organized by physical frame number, but we need to look up by virtual address. This necessitates a linear search through the entire table on every translation—impossibly slow for systems with millions of frames.
4. The Solution: Hashing (Combining with Section 9.4.2)¶
To make inverted page tables practical, they are always used with a hash table.
- The hash function takes the
<pid, p>pair as input. - The hash table points to entries (or chains of entries) in the inverted page table where that virtual address might be located.
- This reduces the search from scanning the whole table to checking just one or a few entries in a linked list.
Performance Note: Even with hashing, a TLB miss now requires at least two memory accesses: one to the hash table and one to the inverted page table entry (plus the final data access). This is slower than a hierarchical page table walk, making the TLB hit ratio even more critical.
5. The Shared Memory Problem¶
This is a significant drawback of inverted page tables.
- In Standard Paging: Multiple processes can map different virtual pages to the same physical frame (shared memory, shared libraries). Each process has its own page table entry pointing to the shared frame.
- In Inverted Page Tables: There is only one entry per physical frame. That entry can only store one
<pid, virtual-page>pair. Therefore, only one process at a time can have that frame mapped into its address space. - Consequence: True sharing, where multiple processes simultaneously map the same frame, is not directly supported. If a second process tries to access shared memory, it will cause a page fault. The OS could handle this by mapping a different physical frame and copying data, breaking true sharing, or by complex workarounds, making shared memory inefficient.
6. Usage¶
Despite the shared memory limitation, inverted page tables are used in some major 64-bit architectures (like older PowerPC/Power and UltraSPARC) where the memory savings from having a single, small table outweighed the drawbacks for their target workloads. Modern x86-64 systems, however, use multi-level hierarchical page tables, often with a hashed walker for upper levels, to better support features like shared memory.
Section 9.4.4: Oracle SPARC Solaris (Case Study)¶
1. Introduction: A Modern 64-Bit Integrated Design¶
This section examines a real-world, high-performance system: Solaris OS on SPARC CPUs. It's a tightly integrated 64-bit design that must manage huge virtual address spaces without wasting physical memory on page tables. Its solution is an efficient, hybrid approach centered around hashed page tables.
2. Core Structure: Hashed Page Tables with "Mapped Regions"¶
Solaris/SPARC uses hashed page tables, but with a key optimization:
- Two Separate Hash Tables:
- Kernel Hash Table: For the kernel's address space.
- User Process Hash Table: Shared by all user processes (this is essentially an inverted page table structure, as it maps physical frames to owning virtual addresses).
- Entry Granularity (The Optimization): Each hash table entry does not represent a single 4KB page. Instead, it represents a contiguous area of mapped virtual memory (a region of many pages).
- Each entry contains:
- Base Virtual Address
- Span: The number of pages in this contiguous region.
- Each entry contains:
- Benefit: This is far more efficient for large, contiguous mappings (common for code segments, large data arrays) as a single hash entry can cover megabytes of memory, drastically reducing the total number of required entries.
3. The Multi-Level Translation Hierarchy¶
The system uses a multi-tiered caching structure for translation, designed to make the common case (TLB hit) extremely fast, while handling misses efficiently.
Follow this translation flow:
Level 1: Translation Lookaside Buffer (TLB)
- Purpose: The fastest hardware cache on the CPU chip, holding the most recently used Translation Table Entries (TTEs).
- Action on memory access: The MMU checks the TLB first.
- TLB Hit: Translation is immediate → Physical address formed → Access proceeds. (This is the common, fast path.)
- TLB Miss: Proceed to Level 2.
Level 2: Translation Storage Buffer (TSB)
- What it is: A larger, software-managed cache of TTEs kept in main memory. It's essentially a direct-mapped cache for page table entries.
- Action on TLB Miss: The CPU's hardware page-table walker automatically searches the in-memory TSB for the TTE corresponding to the faulting virtual address. This is a hardware walk, not a software interrupt.
- TSB Hit: The hardware copies the TTE from the TSB into the TLB. The translation then completes, and the memory access retries (which will now TLB hit).
- TSB Miss: Proceed to Level 3.
Level 3: Kernel Hash Table Walk (Software)
- Action on TSB Miss: The hardware triggers an interrupt to the kernel.
- The kernel's page-fault handler now performs a software search in the appropriate hashed page table (user or kernel) to find the mapping.
- Once found, the kernel:
- Creates a TTE for the mapping.
- Stores this TTE into the TSB (so future misses on this page will TSB-hit).
- Possibly loads it directly into the TLB.
- The interrupt handler returns, the MMU retries the translation (which will now succeed via TLB or TSB), and the original instruction proceeds.
4. Summary and Key Takeaways¶
This design exemplifies sophisticated, real-world memory management:
- Hybrid Structure: Uses hashed page tables (with region granularity) as the primary, space-efficient backing store.
- Multi-Level Caching: Employs a hierarchy (TLB → TSB → Hash Table) to optimize performance. The TSB is a critical intermediary that offloads many misses from costly kernel interrupts to a faster hardware-managed walk.
- Hardware/Software Co-Design: The hardware (MMU) knows about and can walk the software-defined TSB structure. The kernel manages the hash tables and handles the slowest misses. This tight integration is key to performance on 64-bit systems.
- Efficiency for Large Mappings: Using hash entries that cover large contiguous regions reduces table size and improves TSB/TLB effectiveness for workloads with big, contiguous memory areas.
This case shows how advanced OSes solve the page table scalability problem through a combination of intelligent data structures (hashed, region-based tables) and multi-tiered caching (TLB, TSB), with close cooperation between CPU hardware and operating system software.
Section 9.5: Swapping¶
1. The Basic Concept of Swapping¶
A core principle is that instructions/data must be in main memory (RAM) for the CPU to execute them. Swapping is a technique where the OS can temporarily move a process (or part of a process) out of RAM to a backing store (usually a disk) and later bring it back.
- Primary Goal: To allow the total memory requirements of all active processes to exceed the available physical RAM. This increases the degree of multiprogramming (number of processes seemingly running concurrently).
- Visualization: Go to Figure 9.19. This shows the classic view: two processes (
P1,P2) and the OS in memory. One process (P2) can be swapped out to disk, freeing its memory for another use, and later swapped back in.
Section 9.5.1: Standard Swapping (Whole-Process Swapping)¶
How It Works¶
- The entire address space of a process (code, data, stack, heap) is copied as a single contiguous unit between physical memory and the backing store.
- The backing store must be fast (traditionally a dedicated disk partition or fast SSD) and support direct access to quickly read/write these large memory images.
- For multithreaded processes, all per-thread data (stacks, thread control blocks) must also be swapped.
- The OS maintains metadata (process state, location on disk, memory size) for swapped-out processes to restore them later.
Advantage and Use Case¶
- Advantage: Enables memory oversubscription. The system can run more processes than physically fit in RAM at once.
- Ideal Candidate: Idle or low-priority processes. Their memory can be reclaimed for active processes. When they become active again, they are swapped back in.
The Major Drawback: Performance¶
- The time to copy an entire process (which could be many megabytes) to/from disk is prohibitive. The process is completely frozen during this transfer, and system responsiveness suffers.
- Modern Status: Rarely used in contemporary general-purpose OSes due to this high cost. An exception is Solaris, which uses it only as a last-resort emergency measure under extreme memory pressure.
Section 9.5.2: Swapping with Paging (The Modern Standard)¶
The Evolution: From Processes to Pages¶
Because whole-process swapping is too slow, modern systems use a refined technique: Swapping with Paging, usually just called Paging (or Demand Paging, covered in Chapter 10).
- Unit of Transfer: Instead of moving an entire process, the OS moves individual pages (or small groups of pages) between memory and the backing store.
- Terminology:
- Page out: Move a page from RAM to the backing store.
- Page in: Move a page from the backing store back into a free frame in RAM.
- Visualization: Go to Figure 9.20. This illustrates the modern view: Processes A and B have only a subset of their pages in physical memory. Some pages of A are being paged-out, while some pages of B are being paged-in. Memory is a collection of frames holding pages from various processes.
Advantages Over Standard Swapping¶
- Granularity & Speed: Transferring a single 4KB page is vastly faster than transferring an entire 100MB process. This reduces the latency felt by individual processes.
- Efficiency: Only the actively used portions of a process need to reside in memory. Large, unused sections (like error handling code) may never be paged in.
- Flexibility: The OS can page out only the least-used pages from any process, not necessarily the entire least-used process. This leads to more efficient memory utilization.
- Foundation for Virtual Memory: This paging mechanism is the core implementation of virtual memory, which provides the illusion of an extremely large logical address space to each process, regardless of physical RAM size.
Relationship to Virtual Memory¶
- Swapping with paging is not virtual memory itself, but the essential mechanism that makes virtual memory possible.
- Chapter 10 (Virtual Memory) will dive deep into the policies (like demand paging and page replacement algorithms) that decide which pages to page in/out and when to do it, optimizing this process.
In summary, while "swapping" historically meant moving whole processes, in modern systems it refers to the **paging of individual pages to/from disk. This is the critical mechanism that allows systems to efficiently oversubscribe physical memory, supporting more and larger applications than would otherwise be possible.**
Section 9.5.3: Swapping on Mobile Systems¶
1. The Mobile Constraint: No Traditional Swapping¶
Unlike PC/server OSes, mobile operating systems (iOS, Android) typically do NOT support swapping or paging to storage. This is a fundamental design difference due to hardware constraints.
Primary Reasons:
- Storage Medium: Mobile devices use flash memory (eMMC, UFS, NVMe) for storage, not hard disks. While spacious, it has critical limitations.
- Write Endurance: Flash memory cells wear out after a finite number of write/erase cycles. Frequent paging operations (writing dirty pages to storage) would severely shorten the device's lifespan.
- Throughput & Latency: The I/O path between main memory (RAM) and flash storage on mobile SoCs (System-on-a-Chip) is often slower relative to their RAM speed than on PCs/servers, making swapping performance even more punitive.
- Power Consumption: Frequent flash writes consume significant battery power.
2. Mobile OS Strategies for Memory Pressure¶
Since they cannot swap, mobile OSes use alternative, more aggressive methods to free memory when it runs low.
Apple's iOS Approach:¶
- Memory Pressure Notifications: When free memory is critically low, the OS notifies running applications and asks them to voluntarily free up memory (e.g., purge caches, release unused objects).
- Handling Read-only Data: The OS can unload read-only data (like code pages) from RAM. Since the original is unchanged on flash, it can be reloaded later if needed.
- Never Removing Dirty Data: Modified data (stack, heap) is never written to flash due to wear concerns.
- The Ultimate Sanction: If an app fails to free enough memory, the OS terminates it.
Android's Approach:¶
- Process Termination (Low Memory Killer): Android maintains a hierarchy of process importance (foreground app, visible service, background app, etc.). Under memory pressure, it starts terminating processes from least important to most important.
- Application State Preservation: Before terminating a process, Android writes the application's state (UI state, variables) to flash. This allows the app to be quickly restarted to its previous state when the user returns to it, improving the user experience despite the termination.
- Developer Responsibility: The
onSaveInstanceState()callback in Android is part of this mechanism.
Consequence for Mobile Developers:¶
Developers must write memory-efficient code. Memory leaks (failure to release unused memory) or excessive memory use are much more critical on mobile, as they directly lead to app termination (iOS) or being killed first (Android), degrading user experience.
3. Figure Reference: Swapping with Paging¶
Go to Figure 9.20 in your book. While this figure illustrates the swapping with paging mechanism used on desktops/servers, it highlights what mobile systems do not do. They avoid the constant page-out/page-in cycle between RAM and flash storage shown here.
System Performance Under Swapping (General Note)¶
The presence of any swapping activity (page-outs) is a warning sign that the system has more active processes/pages than available physical RAM. This leads to constant disk I/O ( thrashing), severely degrading performance.
The two straightforward solutions are:
- Reduce Demand: Terminate some processes (the mobile approach).
- Increase Supply: Add more physical RAM (the PC/server approach).
This highlights that swapping/paging is a useful illusion for oversubscription, but it is not a substitute for having adequate physical memory. When swapping becomes frequent, performance collapses.
section 9.6: Example: Intel 32- and 64-bit Architectures¶
1. Introduction: The Dominant PC Architecture¶
Intel's x86 architecture has been the foundation of personal computing for decades. Its evolution provides a concrete, real-world case study of memory management hardware.
- Historical Progression: 16-bit (8086/8088) → 32-bit (IA-32, Pentium family) → 64-bit (x86-64).
- Ubiquity: Runs major OSes: Windows, macOS, Linux.
- Mobile Exception: Intel is not dominant in mobile, where ARM architecture prevails (see Section 9.7).
- Scope of Discussion: We'll cover the major memory-management concepts of IA-32 and x86-64. Note that there are many variations and versions across CPU generations.
Section 9.6.1: IA-32 Architecture (32-bit)¶
1. The Two-Stage Address Translation¶
The IA-32 memory management is a hybrid scheme combining Segmentation and Paging. Both are always used together in protected mode (the standard operating mode for modern OSes).
The translation pipeline has two distinct hardware units:
- Segmentation Unit
- Paging Unit
Together, they form the complete Memory Management Unit (MMU).
2. Translation Pipeline¶
Logical Address → [Segmentation Unit] → Linear Address → [Paging Unit] → Physical Address
Step 1: Logical to Linear Address (Segmentation)
- The CPU generates a logical address. This is a pair: a Segment Selector (identifying a segment) and an Offset within that segment.
- The Segmentation Unit uses the selector to look up a Segment Descriptor in a table (GDT or LDT). This descriptor contains the segment's base linear address, limit, and protection bits.
- The unit adds the offset to the segment base to produce a linear address. It also checks the offset against the segment limit for protection.
Step 2: Linear to Physical Address (Paging)
- The linear address is a 32-bit value that is the input to the paging hardware.
- The Paging Unit takes this linear address and, using the page table structures (typically a two-level page table), translates it into a physical address.
3. Visual Reference¶
Go to Figure 9.21 in your book. This figure illustrates the complete two-stage translation process described above, showing the flow from the logical address (selector:offset) through the segmentation and paging units to the final physical address.
4. Key Implications for the OS¶
- OS Design Choice: While the hardware supports full segmentation, most modern OSes (like Windows and Linux) minimize its use to simplify memory management. They often set up segments with a base of 0 and a limit of 4GB, effectively making the logical address equal to the linear address (offset = linear address). This effectively bypasses segmentation for flat memory models.
- Paging is Primary: The paging unit does the heavy lifting of modern memory management (virtual memory, protection, sharing). The segmentation stage is largely a historical artifact that must be configured but is often neutralized.
- Protection from Both Stages: Protection can be enforced at both stages: segment limits and permissions in the segmentation unit, and page-level read/write/execute bits in the paging unit.
In summary, IA-32 uses a mandatory two-stage translation (Segmentation + Paging), but operating systems typically configure it to implement a flat address space where paging provides all the important features of virtual memory.
**Section 9.6.1.1: IA-32 Segmentation **¶
1. Segmentation Model Overview¶
The IA-32 segmentation model provides each process with a two-part logical address space composed of segments.
- Segment Capacity: A single segment can be up to 4 GB in size.
- Total Segments per Process: Up to 16,384 (16K) segments.
- Two Partitions:
- Private Segments (Up to 8K): Belong only to the process. Descriptors stored in the Local Descriptor Table (LDT).
- Shared Segments (Up to 8K): Shared among all processes. Descriptors stored in the Global Descriptor Table (GDT). The OS kernel and shared libraries typically reside here.
- Descriptor Tables (LDT & GDT): These are arrays of 8-byte Segment Descriptors stored in memory. Each descriptor fully defines one segment: its 32-bit base linear address, its limit (size), and its protection/type attributes.
2. The Logical Address Format: (Selector, Offset)¶
A logical address is not a single number but a pair:
- Selector (16 bits): An index into a descriptor table. Its format is:
| Index (13 bits) | Table Indicator (1 bit) | Requested Privilege Level (2 bits) |- Index (bits 3-15): The segment number (0 to 8191). This selects one of the 8K entries in the LDT or GDT.
- Table Indicator - TI (bit 2): 0 = Use the GDT. 1 = Use the LDT.
- Requested Privilege Level - RPL (bits 0-1): The CPU's current privilege level, used for protection checks.
- Offset (32 bits): The byte address within the selected segment.
Visual Reference: Go to Figure 9.22. This shows the translation from the (selector, offset) pair to a linear address using the descriptor tables.
3. Hardware Support: Segment Registers & Cache¶
To speed up segmentation, the CPU provides dedicated hardware:
- Six Segment Registers:
CS(Code),DS(Data),SS(Stack),ES,FS,GS. These hold selectors for the six segments that can be actively used without a table lookup at any given moment. - Six Microprogram Registers (Descriptor Caches): Hidden registers that automatically cache the 8-byte descriptors corresponding to the selectors in the segment registers.
- Performance Benefit: When a selector is loaded into a segment register, the CPU automatically fetches the full descriptor from the LDT/GDT and stores it in the hidden cache. Subsequent memory accesses via that segment register use the cached descriptor, avoiding slow memory reads for every translation.
4. Linear Address Generation Steps¶
For a logical address (selector, offset):
Selector Validation & Descriptor Fetch:
- The CPU uses the TI bit to choose either the GDT or LDT.
- It uses the Index from the selector to locate the Segment Descriptor in the chosen table.
- This descriptor is loaded into the corresponding hidden cache register (if not already cached).
Protection & Limit Check (Crucial):
- The CPU compares the 32-bit offset against the segment limit from the descriptor.
- If
offset > limit, a segmentation fault (protection exception) is triggered, trapping to the OS. - It also checks the access rights in the descriptor against the type of access (read/write/execute) and the CPU's current privilege level.
Linear Address Calculation:
- If checks pass, the CPU takes the 32-bit base linear address from the segment descriptor.
- It adds the 32-bit offset to this base.
- The result is the 32-bit linear address.
This linear address is then passed to the **paging unit to be translated into a physical address (discussed in the next section).** This two-stage process (segmentation then paging) is the hallmark of IA-32 memory management.
Section 9.6.1.2: IA-32 Paging¶
1. Page Sizes and Two-Level Paging (Standard)¶
The IA-32 paging unit supports two page sizes:
- 4 KB (Standard)
- 4 MB (Large Pages)
For the standard 4 KB pages, IA-32 uses a two-level hierarchical page table, identical to the forward-mapped scheme described in Section 9.4.1.
Linear Address Split for 4KB Pages:
| Directory Index (p1) - 10 bits | Table Index (p2) - 10 bits | Offset (d) - 12 bits |
Visual Reference: Go to Figure 9.23. The left side of this figure shows the two-level scheme for 4KB pages.
Translation Process for 4KB Pages:
- CR3 Register: Contains the physical address of the Page Directory for the current process (loaded by the OS on a context switch).
- First Level - Page Directory: Use the high 10 bits (
p1) of the linear address as an index into the Page Directory. The entry found is a Page Directory Entry (PDE). - Second Level - Page Table: The PDE contains the physical address of a Page Table. Use the next 10 bits (
p2) as an index into this Page Table. The entry found is a Page Table Entry (PTE). - Frame & Offset: The PTE contains the physical frame number. Combine this with the low 12-bit offset (
d) to form the final physical address.
2. 4 MB Large Pages (A Optimization)¶
To reduce TLB pressure and page table depth for large, contiguous regions, IA-32 supports 4 MB pages.
- Mechanism: A Page Size (PS) flag in the Page Directory Entry (PDE). If
PS = 1, the PDE points directly to a 4 MB physical page frame, bypassing the inner page table entirely. - Address Split for 4MB Pages: The linear address is divided differently:
- Directory Index (p1): 10 bits (index into Page Directory).
- Offset (d): 22 bits (to address any byte within the 4 MB frame).
- Translation: The PDE provides the high 10 bits of the 32-bit physical frame address; the low 22 bits come from the linear address offset. This is shown on the right side of Figure 9.23.
3. Swapping Page Tables to Disk¶
The page tables themselves can be large. To save physical memory, IA-32 allows page tables (and even the page directory) to be paged out to disk.
- Invalid Bit: The
presentbit in a PDE or PTE indicates if the referenced page table or page is in memory (present=1) or on disk (present=0). - OS Use of Entry: If
present=0, the remaining 31 bits of the entry are not used by hardware. The operating system is free to use them to store the disk location of the swapped-out page table. This is a classic example of hardware/software cooperation.
4. Overcoming 32-bit Limits: Page Address Extension (PAE)¶
The 32-bit address bus limited physical memory to 4 GB. To support more RAM (e.g., in servers), Intel introduced PAE (Page Address Extension).
- What it does: Increases the physical address space from 32 to 36 bits, supporting up to 64 GB of RAM. The linear address remains 32 bits (processes still have a 4 GB virtual address space).
- Architectural Change: Paging changes from two-level to three-level.
- New Structure: Page Directory Pointer Table (PDPT) → Page Directory → Page Table → Page.
- Entry Size: PDEs and PTEs expand from 32 to 64 bits to hold the larger 24-bit base addresses (instead of 20-bit) needed for 36-bit physical addressing.
- Visual Reference: Go to Figure 9.24. This illustrates the three-level PAE scheme with 4KB pages. The CR3 register now points to a PDPT.
Linear Address Split for PAE with 4KB Pages:
| PDPT Index (2 bits) | Directory Index (9 bits) | Table Index (9 bits) | Offset (12 bits) |
(Note: Sizes shift due to 64-bit entries and alignment).
- OS Support Required: The operating system must be explicitly written to use PAE. Linux and macOS support it. However, 32-bit desktop Windows versions typically artificially limit usable RAM to 4 GB for driver compatibility reasons, even if PAE is enabled in the CPU.
Section 9.6.2: x86-64 Architecture¶
1. Historical Context: AMD Leads the Way¶
Intel's path to 64-bit computing was not straightforward.
- Intel's First Attempt (IA-64/Itanium): A completely new, non-x86 architecture. It failed to gain widespread adoption.
- AMD's Innovation (x86-64): AMD designed a 64-bit extension of the existing IA-32 (x86) instruction set. This ensured backward compatibility with the vast library of existing 32-bit software, which was key to its success.
- Role Reversal: Intel, seeing AMD's success, ultimately adopted AMD's x86-64 architecture (marketing it as Intel 64, EM64T). This is the foundation of all modern 64-bit Intel and AMD PC/server processors.
2. The 64-bit Address Space: Theory vs. Practice¶
- Theoretical Maximum: A full 64-bit address space is 2^64 bytes = 16 Exabytes. This is astronomically large.
- Practical Implementation: Current x86-64 CPUs do not implement all 64 bits in hardware. It would make address translation tables (page tables) impossibly large and complex.
- Actual Virtual Address Size: 48 bits. This provides a virtual address space of 256 Terabytes (2^48 bytes) per process, which is still massive.
- Canonical Address Form: The upper 16 bits (bits 48-63) of a 64-bit virtual address must be a sign extension of bit 47 (all 0s or all 1s). This is a hardware requirement to allow future expansion to more bits while maintaining compatibility. This is why Figure 9.25 shows bits 48-63 as "unused" in the current design.
3. Paging Hierarchy: Four-Level Paging¶
To manage the 48-bit virtual address space efficiently, x86-64 uses a four-level hierarchical page table. This is a direct extension of the two/three-level schemes seen in IA-32.
Linear Address Split (48-bit effective, 64-bit format): As shown in Figure 9.25, the bits are divided among the four levels and the offset. For a standard 4 KB page:
- Bits 47-39 (9 bits): Index into the Page Map Level 4 (PML4) table.
- Bits 38-30 (9 bits): Index into the Page Directory Pointer Table (PDPT).
- Bits 29-21 (9 bits): Index into the Page Directory.
- Bits 20-12 (9 bits): Index into the Page Table.
- Bits 11-0 (12 bits): Byte Offset within the 4 KB page.
Translation Walk: The CR3 register points to the physical base of the PML4 table. The MMU walks through each level using the indices, finally retrieving the physical frame number from the PTE.
4. Page Sizes and Physical Address Support¶
- Supported Page Sizes: 4 KB, 2 MB, and 1 GB. Larger pages reduce TLB misses for big, contiguous memory regions.
- Physical Address Space: With PAE technology extended, x86-64 supports 52-bit physical addresses (2^52 bytes = 4 Petabytes of RAM). This is far beyond what current systems have, providing ample headroom. The physical frame numbers in PTEs are 52 bits wide.
In summary, x86-64 is a 64-bit evolution of IA-32 that uses a 48-bit virtual address with a four-level page table hierarchy. It maintains backward compatibility while providing an enormous virtual and physical address space, using sophisticated paging with multiple page sizes to manage it efficiently.
Chapter 9.7: Example: ARMv8 Architecture¶
1. Introduction: The Mobile and Embedded Giant¶
While Intel dominates PCs/servers, ARM dominates the mobile and embedded world (smartphones, tablets, IoT). Its business model is unique: ARM designs the CPU architecture and licenses the designs to other companies (like Apple, Qualcomm, Samsung) who manufacture the chips.
- Ubiquity: Over 100 billion ARM processors produced, making it the most produced CPU architecture ever.
- Focus: We examine the 64-bit ARMv8 architecture (used in modern smartphones and servers).
2. Flexibility: Translation Granules¶
A key design feature of ARMv8 is its flexibility in the fundamental unit of translation, called the Translation Granule. The OS can choose one of three sizes:
| Translation Granule | Page Size(s) | Region Size(s) |
|---|---|---|
| 4 KB | 4 KB | 2 MB, 1 GB |
| 16 KB | 16 KB | 32 MB |
| 64 KB | 64 KB | 512 MB |
- Granule Size: Defines the size of the smallest unit (page) and influences the structure of the page tables.
- Region Sizes: These are large contiguous blocks of memory that can be mapped by a single entry in an upper-level page table, bypassing deeper levels. This is a major optimization for mapping big chunks of memory (like the kernel or a large library) efficiently.
3. Address Translation for 4 KB Granule (The Common Case)¶
Go to Figure 9.26. This shows the 48-bit virtual address split for the 4 KB translation granule (note the unused high 16 bits, similar to x86-64). The address is divided into indices for up to four levels of page tables.
- Bits 47-39 (9 bits): Level 0 Index
- Bits 38-30 (9 bits): Level 1 Index
- Bits 29-21 (9 bits): Level 2 Index
- Bits 20-12 (9 bits): Level 3 Index
- Bits 11-0 (12 bits): Offset within a 4 KB Page.
Go to Figure 9.27. This illustrates the four-level hierarchical paging structure. The Translation Table Base Register (TTBR) points to the Level 0 table for the current thread/process.
4. The Power of Regions: Flexible Hierarchical Mapping¶
This is where ARMv8 gets clever. Entries in the Level 1 and Level 2 tables are not required to point to a next-level table. They can instead point directly to a Region.
- Level 1 Entry → 1 GB Region: If a Level 1 entry is configured as a region descriptor, it maps a contiguous 1 GB block of virtual memory directly to physical memory. The translation stops here. The low 30 bits of the virtual address become the offset within this 1 GB region.
- Level 2 Entry → 2 MB Region: Similarly, a Level 2 entry can be a region descriptor for a 2 MB block. The low 21 bits become the offset.
- Benefit: This drastically reduces page table size and TLB usage for large, contiguous memory areas. The OS kernel or a large app can be mapped with a handful of region entries instead of thousands of page entries.
5. TLB Hierarchy: Micro TLBs and Main TLB¶
ARMv8 uses a two-level TLB hierarchy for speed:
- Inner Level: Micro TLBs (µTLB):
- Very small, extremely fast TLBs that are split into a separate Instruction µTLB and Data µTLB.
- They support ASIDs for process-specific tags.
- The first lookup happens here, in parallel for instruction fetch and data access.
- Outer Level: Main TLB:
- A larger, unified TLB.
- On a µTLB miss, the hardware checks the Main TLB.
- Page Table Walk:
- If both TLBs miss, the hardware MMU performs an automatic page table walk in memory (using the TTBR and the multi-level structure).
- The resulting translation is loaded into the TLBs.
6. Looking Ahead: The Need for 64-bit¶
The sidebar on 64-bit Computing poses a philosophical question: Will we ever need the full 64-bit address space (16 exabytes)?
Consider future demands:
- In-memory Databases: Entire corporate or even national datasets kept in RAM.
- Advanced Simulation & AI: Modeling complex systems (climate, proteins, the brain) at high fidelity requires massive memory.
- Extreme Multimedia: Ubiquitous high-fidelity 3D, holographic, or volumetric video.
- Memory-centric Computing: Architectures where memory is the central resource, not the CPU.
History suggests that software and data expand to consume available resources. While 48 bits (256 TB) is vast today, the jump to 64 bits provides an essentially limitless headroom for future, unimaginable applications, ensuring the architecture's longevity.
Section 9.8: Summary¶
1. The Central Role of Memory¶
Memory (RAM) is the central storage that the CPU directly accesses to fetch instructions and data. It is organized as an array of addressable bytes.
2. Basic Hardware Protection: Base and Limit Registers¶
A fundamental hardware mechanism for memory protection and relocation uses two registers:
- Base Register: Holds the starting physical address of a process.
- Limit Register: Holds the size of the process's address space. Together, they define a protected, contiguous memory region. Every address is checked against the limit and added to the base by the MMU.
3. Address Binding Timing¶
The mapping of symbolic addresses to physical addresses can occur at different times, affecting flexibility:
- Compile Time: Generates absolute code. Inflexible.
- Load Time: Generates relocatable code. Binding occurs on load.
- Execution Time: Binding delayed until runtime. Requires hardware (MMU) support. Used by all modern systems.
4. Logical vs. Physical Addresses¶
- Logical (Virtual) Address: Generated by the CPU. The process's view of memory.
- Physical Address: The actual address on the memory bus. The Memory Management Unit (MMU) translates logical addresses to physical addresses.
5. Contiguous Memory Allocation & Its Problems¶
Early systems allocated contiguous partitions to processes. Placement strategies included:
- First-Fit: Fast, allocates the first suitable hole.
- Best-Fit: Minimizes leftover hole size, but creates many tiny fragments.
- Worst-Fit: Maximizes leftover hole, generally performs poorly. All suffer from external fragmentation (free memory broken into small, non-contiguous holes) and may require compaction.
6. Paging: The Fundamental Solution¶
Paging eliminates external fragmentation by allowing a process's physical memory to be noncontiguous.
- Physical memory divided into frames.
- Logical memory divided into pages of the same size.
- The page table (one per process) maps page numbers to frame numbers.
- The logical address is split:
(Page Number p, Page Offset d). - The page number indexes the page table; the retrieved frame number is combined with the offset to form the physical address.
7. The Translation Lookaside Buffer (TLB)¶
A hardware cache for page table entries to avoid the performance penalty of a memory access for every translation.
- TLB Hit: Translation is immediate (fast path).
- TLB Miss: Requires a full page table walk (slow). The new mapping is then loaded into the TLB.
- Address-Space Identifiers (ASIDs) allow the TLB to hold entries for multiple processes without flushing on a context switch.
8. Structuring Page Tables for Large Address Spaces¶
Simple linear page tables are too large for 32/64-bit spaces. Solutions include:
- Hierarchical (Multi-Level) Paging: The page table itself is paged. Standard for 32-bit (2 levels) and 64-bit (3-4 levels) systems.
- Hashed Page Tables: Uses a hash table to map virtual page numbers to entries. Good for sparse address spaces.
- Inverted Page Tables: One system-wide table indexed by physical frame number. Saves space but complicates sharing and requires hashing for performance.
9. Swapping¶
The technique of moving process data between memory and backing store (disk) to allow memory oversubscription.
- Standard Swapping: Moves entire processes. Too slow, rarely used.
- Swapping with Paging (Modern): Moves individual pages. The core mechanism of virtual memory.
- Mobile Systems: Typically do not swap due to flash memory wear and performance constraints. They use app termination and state preservation instead.
10. Real-World Architectures¶
- Intel IA-32 (32-bit): Uses combined segmentation (often neutralized) and paging (two-level, with PAE for >4GB RAM).
- Intel/AMD x86-64 (64-bit): Uses 48-bit virtual addresses with a four-level page table hierarchy. Supports 4 KB, 2 MB, and 1 GB pages.
- ARMv8 (64-bit): Highly flexible with translation granules (4 KB, 16 KB, 64 KB) and region mappings. Uses a two-level TLB (micro TLBs and main TLB) and up to four levels of paging.
Key Takeaway¶
Modern memory management is built on paging, which provides the illusion of a large, private, contiguous address space for each process while efficiently using and protecting physical memory through hardware-assisted translation (MMU, TLB) and sophisticated OS data structures (page tables). This foundation enables virtual memory, explored in the next chapter.
Chapter 10: Virtual Memory¶
10.1 Background¶
The memory-management algorithms we studied in Chapter 9 (like paging and segmentation) exist because of a fundamental hardware requirement: instructions must be in physical memory (RAM) to be executed by the CPU. The simplest way to satisfy this requirement is to load the entire logical address space of a program into physical memory. Techniques like dynamic linking can help a bit, but they often need special effort from the programmer.
While the requirement that code must be in RAM to run seems obvious, it creates a major limitation: it restricts the maximum size of a program to the size of the physical memory available. If you only have 4GB of RAM, you cannot run a single program that needs 5GB, even if parts of it aren't used simultaneously.
Key Insight: Programs are rarely fully utilized.
In practice, we don't need the entire program in memory at all times. Consider these common scenarios:
- Error Handling Code: Programs often contain routines to handle rare errors (like disk-full or network failure). This code is crucial but almost never executed.
- Over-allocated Data Structures: Programmers frequently declare arrays, lists, or tables larger than they ever use (e.g., a 1000x1000 array that never holds more than 10x10 items). This reserved memory sits idle.
- Infrequently Used Features: Many programs have components used only occasionally (e.g., a "balance the federal budget" routine in government software that hasn't been run in years).
- Temporal Locality: Even for essential parts, a program doesn't need everything at the same time. It executes in phases, using different subsets of its code and data at different times.
The Benefits of Partial Loading¶
If we could execute a program with only part of it in physical memory, we'd gain significant advantages:
- Program Size Freedom: A program is no longer limited by physical RAM size. Programmers can design for a huge virtual address space (e.g., 64-bit addresses), making development simpler.
- Higher System Throughput: Since each running program consumes less physical memory, the OS can keep more programs in memory simultaneously. This leads to better CPU utilization and higher overall system throughput (more work done per second) without slowing down individual programs.
- Faster Program Startup and Execution: Loading only the essential parts of a program into memory requires less I/O (disk reading) initially. A program can begin execution faster, and as it runs, only the needed pieces are loaded, improving perceived performance.
In summary, running partially-loaded programs benefits both the system (efficiency, throughput) and the user (larger programs, faster response).
The Virtual Memory Abstraction¶
Virtual memory achieves this by decoupling the programmer's logical view of memory from the actual physical hardware. It provides the illusion of a very large logical memory space while using a much smaller amount of physical RAM.
(Refer to Figure 10.1 in your text) Go to Figure 10.1. This diagram is central. It shows a large virtual memory space (on the left) being mapped onto a smaller physical memory via a memory map. Some parts ("pages") of the virtual memory reside in physical RAM frames, while others reside on the backing store (disk). The OS and hardware manage this mapping dynamically and transparently.
The Virtual Address Space¶
The virtual address space is the logical view of how a process is stored in memory. To the process, it appears as a single, contiguous space starting at address 0.
(Refer to Figure 10.2 in your text) Go to Figure 10.2. This shows the typical layout of a process's virtual address space in a Unix/Linux system:
- Text: The program's executable code (read-only).
- Data: The global and static variables.
- Heap: Dynamically allocated memory (via
malloc()ornew). It grows upward (toward higher addresses) as more memory is requested. - Stack: Used for function call management, local variables, and return addresses. It grows downward (toward lower addresses) with each function call.
The large blank space between the heap and stack is unused virtual memory. It is part of the address space but consumes no physical resources unless the heap or stack grows into it. Address spaces with these unused gaps are called sparse address spaces. This sparsity is efficient and allows for flexible growth of segments and dynamic linking of libraries during execution.
Sharing and Communication Enabled by Virtual Memory¶
Beyond just enlarging the apparent memory, virtual memory enables powerful sharing features through page sharing:
- Sharing System Libraries: Common libraries (like the standard C library
libc) can be loaded into physical memory once. Then, the OS maps the same physical pages into the virtual address space of every process that uses that library.- (Refer to Figure 10.3 in your text) Go to Figure 10.3. It illustrates this beautifully. Two separate processes have their own private
text,data,heap, andstack. However, the pages for the shared library are mapped to the same physical page frames. This saves tremendous amounts of memory.
- (Refer to Figure 10.3 in your text) Go to Figure 10.3. It illustrates this beautifully. Two separate processes have their own private
Shared Memory for Inter-Process Communication (IPC): Processes can create regions of memory intended explicitly for sharing. Each process maps this region into its own virtual address space, but behind the scenes, the OS points their mappings to the same set of physical pages. This provides a very fast method for processes to communicate (as discussed in Chapter 3).
Fast Process Creation (
fork()): When a process creates a child usingfork(), the OS can use a technique called copy-on-write. Instead of immediately duplicating all parent pages, it simply maps the child's virtual pages to the same physical pages as the parent, marking them as read-only. Only if either process tries to modify a page is a separate copy created. This makesfork()extremely fast and memory-efficient.
We will now explore the primary mechanism that makes all this possible: demand paging.
10.2 Demand Paging¶
Core Concept: Loading on Need, Not All at Once¶
Imagine you have a massive encyclopedia program. At the start, it shows you a main menu with chapters like "Animals," "History," "Chemistry." The old way would be to load the entire encyclopedia—every single chapter—into your computer's physical RAM as soon as you double-click the icon, even though you might only ever read the "Animals" section.
- Problem with the old way: It's wasteful. It uses up precious, fast physical memory (RAM) for code and data that the program might never actually need during its current run. This slows down other programs and limits how many you can run at once.
What is Demand Paging?¶
Demand paging is the smarter, more efficient alternative. It's a fundamental technique used by all modern virtual memory systems.
- Definition: It is a strategy where pages of a program (chunks of its code and data) are loaded into physical memory only when they are actually needed (or "demanded") during execution.
- How it works: The program initially resides on the slower secondary storage (like your HDD or SSD). The operating system loads just the first page or two to get the program started (e.g., to show the main menu).
- The "Demand" Trigger: A page is "demanded" when the CPU tries to access an address that belongs to a page currently not in physical RAM. This access attempt causes a special interrupt called a page fault.
Key Analogy and Connection¶
Think of it like a paging system with swapping, but more granular and proactive. (Refer to Section 9.5.2 for background on swapping).
- In swapping, an entire process is moved between memory and disk.
- In demand paging, we swap individual pages in and out, and only do it when the program explicitly needs that specific page.
The Primary Benefit¶
This is a major reason why virtual memory is so powerful:
- Efficient Memory Use: Physical RAM is filled only with the pages that are actively being used. Pages that are never accessed (like the code for all the unselected menu options in the book's example) are never loaded. This allows the system to run more programs concurrently and makes much better use of available RAM.
Crucial Terminology Recap¶
- Page: A fixed-size block of a program's address space (e.g., 4 KB).
- Frame: A same-sized block of physical memory where a page can be loaded.
- Page Fault: The interrupt that occurs when a needed page is not in memory. This is the central event that triggers the demand-paging mechanism.
- Secondary Storage (Backing Store): Where all pages live when not in RAM (HDD, SSD, NVM device). Also called the swap space or page file.
10.2.1 Basic Concepts¶
The Two Homes for Pages¶
In a demand-paging system, a process's pages live in one of two places at any given moment:
- In Memory (Physical RAM): Actively being used or recently used.
- On Secondary Storage (Backing Store/ Swap Space): The "waiting room" for all other pages.
Because of this split, the hardware and operating system need a clear way to know where a specific page currently is.
Hardware Support: The Valid-Invalid Bit (Revisited)¶
We use the same valid-invalid bit scheme from the page table (as discussed in Section 9.3.3), but with an expanded meaning.
Bit Set to "Valid" (e.g., 'v'): This has the standard meaning. The associated page is:
- Legal (part of the process's logical address space).
- Loaded in a frame in physical memory. The page table entry contains the real physical frame number.
Bit Set to "Invalid" (e.g., 'i'): This now has two possible meanings:
- The page is not valid (not part of the process's address space). Accessing it is a programming error.
- The page is valid, but is currently located on secondary storage. This is the new, crucial case for demand paging.
Important Note: Marking a page as "invalid" in this context has no performance cost unless the program actually tries to access it. It's just a note in the page table.
Visualizing the Page Table (Go to Figure 10.4)
Look at Figure 10.4. It shows a process's page table and the corresponding physical memory.
- Pages 0, 2, and 6 are marked valid (
v). Their frame numbers (4, 6, 9) point to where they are loaded in physical memory (you can see pages A, C, and F there). - Pages 1, 3, 4, 5, and 7 are marked invalid (
i). They are valid pages but are currently on the backing store (you can see pages B, D, E, G, H waiting on disk). - Page 5's entry is empty, indicating it might be truly unused/invalid.
The Critical Event: Page Fault¶
What happens when the CPU tries to access (read or write) an address in a page marked "invalid"?
- The paging hardware (the MMU) consults the page table during address translation.
- It sees the invalid bit is set.
- It cannot complete the translation, so it triggers a trap (an interrupt) to the operating system. This specific trap is called a page fault.
Handling a Page Fault: Step-by-Step (Go to Figure 10.5)
The OS follows a standard procedure to resolve the fault, make the page available, and resume the program transparently. Follow along with the numbered steps in Figure 10.5.
Validate the Access: The OS checks an internal table (part of the Process Control Block - PCB) for this process. It must determine: Was this memory reference to a valid page (just not in memory) or an invalid one (an illegal address)?
- If Invalid: The process has a bug (e.g., accessing null/unallocated memory). The OS typically terminates the process.
- If Valid (but not in memory): Proceed to step 2. This is the "demand paging" case.
Locate the Page: The OS now knows it needs to fetch page X from secondary storage.
Find a Free Frame: The OS takes a free frame from a system-wide list of available physical memory frames.
Schedule Disk I/O: The OS schedules a disk read operation to bring the desired page from the backing store into the newly allocated frame. This is a potentially slow, blocking operation. The process is put in a waiting state.
Complete the I/O & Update Tables: When the disk I/O completes:
- The page's data is now in the physical frame.
- The OS modifies the process's page table entry for this page:
- Sets the valid-invalid bit to
valid. - Writes the physical frame number into the entry.
- Sets the valid-invalid bit to
- The OS also updates its internal bookkeeping data structures.
Restart the Instruction: The OS must roll back and restart the exact machine instruction that caused the page fault. The CPU state is restored to just before the fault occurred. This time, when the instruction re-executes:
- The page table entry is now valid.
- The address translation succeeds.
- The process accesses the memory location as if the page had been there all along. The process is completely unaware of the interruption.
The Illusion¶
This entire complex procedure is what creates the powerful illusion of virtual memory. To the process, it appears to have a vast, contiguous address space all readily available in fast memory. The OS and hardware work together behind the scenes to maintain this illusion by swapping pages between fast RAM and slow disk on an as-needed basis.
Pure Demand Paging: The Extreme Starting Point¶
The concept of demand paging can be taken to its logical extreme:
- Initial State: A process begins execution with zero pages loaded into physical memory. Its entire address space exists only on the secondary storage (swap space).
- First Instruction Fault: When the OS starts the process and sets the CPU's instruction pointer (IP) to the first instruction's address, that address resides in a page not in memory. This causes an immediate page fault.
- Fault-Driven Loading: The OS handles the fault, loads the required page (containing the startup code), and restarts the instruction. The process then executes, triggering a page fault each and every time it tries to access a new page (for code or data) for the first time.
- Steady State: Eventually, the process will have all the pages it actively needs loaded into memory. At this point, page faults stop (or become very rare), and the process runs efficiently. This ideal model is called pure demand paging: a page is brought into memory only when a reference to it actually occurs.
Performance Concern: Could This Be Too Slow?¶
A major theoretical concern arises:
- What if a single instruction itself causes multiple page faults? For example, an instruction might be long and cross a page boundary (fault for instruction page), and then it might read data from one page (second fault) and write results to another page (third fault).
- If every instruction required multiple slow disk operations, system performance would be unacceptably poor.
The Savior: Locality of Reference¶
Fortunately, real program execution exhibits a property called locality of reference (detailed in Section 10.6.1). This means:
- Programs tend to access memory in clustered patterns, not randomly.
- After accessing one memory location, the next access is very likely to be nearby (in the same page or adjacent pages).
- Consequence for Paging: Once a page is faulted into memory, many subsequent instructions and data accesses will be satisfied from that same page or recently used pages, leading to a high hit rate. The number of page faults per instruction is very low in practice, making demand paging viable.
Hardware Requirements for Demand Paging¶
The hardware needed is not new; it's the same as for standard paging and swapping, but used in a specific way:
Page Table with Fault Detection:
- Must have a mechanism to mark an entry as not resident in main memory.
- This is typically the valid-invalid bit, as discussed.
- Alternatively, special protection bits (like a "present" bit) can serve the same purpose.
Secondary Memory (Swap Space):
- This is the backing storage that holds all pages not currently in RAM.
- Device: Usually a high-speed disk (HDD) or Non-Volatile Memory (NVM) device like an SSD.
- Terminology:
- Swap Device: The entire disk/NVM device used for this purpose.
- Swap Space: The specific area (partition or file) on that device reserved for holding swapped-out pages.
- Management: How the operating system allocates and manages this swap space is a critical topic covered in Chapter 11.
Key Takeaway¶
Demand paging relies on the interplay of:
- Hardware (MMU, Page Table): To detect when a needed page is absent (page fault).
- Operating System (Fault Handler): To perform the slow work of fetching the page from disk and updating tables.
- Program Behavior (Locality): To ensure that this fault-handling overhead is infrequent enough to maintain good performance, creating the successful illusion of a vast, fast virtual memory.
A Crucial Requirement: Restarting Any Instruction¶
The Core Principle¶
For demand paging to work transparently, the system must have a fundamental capability:
- After handling a page fault, the operating system must be able to restart the interrupted instruction from the very beginning, in the exact same state, as if the fault never happened. The only difference should be that the previously missing page is now in memory.
How is this possible? When the page fault trap occurs, the hardware automatically saves the execution state of the process (all registers, the condition codes, and crucially, the instruction counter/program counter pointing to the faulting instruction). This saved state allows the OS to restore and restart later.
Restarting in Typical Cases¶
For most simple instructions, restarting is straightforward:
- Fault on Instruction Fetch: The CPU hasn't even started the instruction. Restarting just means fetching it again from the same address (which is now accessible).
- Fault on Operand Fetch: The CPU fetched and decoded the instruction but faulted while trying to read the data it needs (the operand). To restart:
- Fetch the instruction again.
- Decode it again.
- Fetch the operand again (which will now succeed).
A Detailed Walkthrough: The Three-Address ADD Instruction¶
Consider the instruction: ADD A, B, C (Add contents of A to B, store result in C). Its execution involves multiple, distinct steps:
- Fetch and decode the
ADDinstruction. - Fetch operand A.
- Fetch operand B.
- Perform the addition (A + B).
- Store the result to location C.
Scenario: What if a page fault occurs during step 5 because the page containing memory location C is not in RAM?
- The addition (step 4) has already been computed.
- The OS handles the fault: brings in the page for C, updates the page table.
- To restart, the OS reloads the saved state. The instruction pointer points back to the
ADDinstruction. - The process must now execute the entire instruction again:
- Re-fetch, re-decode
ADD. - Re-fetch A and B.
- Re-compute the sum.
- Store to C (which now succeeds).
- Re-fetch, re-decode
- Analysis: There is some repeated work (steps 1-4 are re-done), but it's less than one full instruction's worth of extra work. This overhead is acceptable because page faults are meant to be infrequent.
The Major Difficulty: Instructions That Modify Multiple Locations¶
The real architectural challenge comes from complex instructions that can modify many memory locations before completing. A classic example is the IBM System 360/370 MVC (Move Character) instruction, which copies a block of up to 256 bytes.
The Problem:
- The instruction might straddle page boundaries. The source block or destination block might span two (or more) pages.
- A page fault could occur mid-operation—after some bytes have already been copied, but before the instruction finishes.
- If the source and destination blocks overlap, the partially completed move may have already overwritten some of the original source data. You cannot simply restart from the beginning because the source data is now corrupted.
Architectural Solutions¶
CPU architects have designed solutions to make such instructions "restartable." Two common strategies are:
Pre-check and Validate Access Before Any Modification:
- The CPU's microcode (the low-level instruction logic) is designed to first calculate and attempt to access the start and end addresses of both the source and destination blocks.
- If any of these required pages are not in memory, a page fault occurs immediately, before a single byte is moved.
- After the OS resolves the fault and restarts the instruction, all pages are guaranteed to be present, and the move proceeds to completion without interruption.
Use of Temporary Registers (Write-Back on Fault):
- As the instruction executes, if it needs to overwrite a memory location, it first copies the old value into a hidden temporary register inside the CPU.
- It then writes the new value.
- If a page fault occurs later in the instruction, the CPU's microcode rolls back by writing the saved old values from the temporary registers back to their memory locations.
- This restores memory to its exact state before the instruction began, allowing a clean restart.
Key Architectural Insight¶
This discussion highlights a critical point:
- Paging is inserted as a layer between the CPU and physical memory.
- The goal is complete transparency to the software process.
- While it's often assumed paging can be added to any system, this is only true for systems without demand paging (where a missing page is just a fatal error).
- Demand paging requires specific, careful CPU architectural support to guarantee that any instruction can be interrupted by a page fault and later restarted without causing side effects or data corruption. The
MVCexample illustrates the non-trivial engineering required to make this work.
10.2.2 Free-Frame List¶
Purpose: The OS's Pool of Memory Frames¶
A free-frame list is a fundamental operating system data structure. It is simply a list (or pool) of physical memory frames that are currently unused and available for allocation.
- Why is it needed? When a page fault occurs, the OS needs an empty physical frame to load the missing page into. Similarly, when a process's stack or heap grows, it needs new frames for the expanded segment.
- Function: The free-frame list is the source for these frames. It's the OS's "inventory" of ready-to-use physical memory.
Visual Representation (Go to Figure 10.6)¶
Figure 10.6 shows a simple linked-list representation of the free-frame list. Each node in the list contains the frame number of a free physical frame (e.g., frame 7, frame 97, etc.). The OS maintains a head pointer to track the next available frame.
Frame Allocation Policy: Zero-Fill-on-Demand¶
When the OS takes a frame from the free list and assigns it to a process, it cannot give the process whatever old data happens to be in that frame. This is a critical security and correctness requirement.
- The Policy: Zero-fill-on-demand. Before a frame is handed over to a process, the OS ensures the entire frame is filled with zeros.
- Why is this essential?
- Security: A frame previously used by one process might contain sensitive data (passwords, keys, documents). If reassigned without clearing, the new process could read this old data, causing a severe information leak.
- Correctness: The new process expects its memory to be initialized (often to zero). Random old data would cause unpredictable bugs.
- The "on-demand" part means the zeroing is done at the time of allocation, not necessarily beforehand. This avoids unnecessary work if the frame isn't used.
Lifecycle of the Free-Frame List¶
- System Startup: When the OS boots, it identifies all physical memory not required for the kernel itself. All these available frames are placed onto the free-frame list. The list is at its maximum size.
- Normal Operation (Depletion): As processes run and cause page faults or expand their segments, frames are removed from the free list and assigned. The free-frame list shrinks.
- Critical Threshold: Eventually, the free-frame list will become empty or fall below a low-water mark (a minimum threshold set by the OS).
- Problem: The next page fault cannot be satisfied because there is no free frame to load the new page into.
- Replenishment: When the list is (near) empty, the OS must repopulate it. It does this by freeing up frames that are currently in use by selecting victim pages and writing them out to disk if modified. This process is known as page replacement.
Next Step: The strategies and algorithms for deciding which frame to free (the page-replacement policy) and handling the overall memory pressure are the central topics of Section 10.4.
10.2.3 Performance of Demand Paging¶
Demand paging is not free. It introduces a massive performance penalty when a page fault occurs, and the overall system performance depends almost entirely on how often these faults happen.
1. The Effective Access Time Formula¶
We measure the impact using Effective Access Time (EAT)—the average time it takes for the CPU to access a memory location, factoring in both normal accesses and page faults.
ma= Memory Access Time (when the page is in RAM). Example: 10 - 200 nanoseconds.p= Probability of a Page Fault (0 ≤ p ≤ 1). We want this to be very close to 0.- Page Fault Time = The total time to service a single page fault (can be milliseconds).
The formula is a weighted average: Effective Access Time = (1 – p) × ma + p × (Page Fault Time)
This shows that the overall speed is dominated by the page fault rate (p) because Page Fault Time is enormous compared to ma.
2. Anatomy of a Page Fault (The Source of the Delay)¶
A page fault is not a simple disk read. It's a complex sequence of software and hardware operations:
Sequence of Events During a Page Fault:
- Trap to OS: CPU interrupt.
- Save Process State: Save registers, program counter, etc.
- Diagnose Fault: OS determines this interrupt was a page fault.
- Validate & Locate: Check if the address is legal. Find where the needed page is on disk.
- Issue I/O Request: Request to read the page into a free frame.
- a. Queue Wait: Wait in line for the disk.
- b. Seek & Latency: Disk head moves and waits for the right sector (the slowest part).
- c. Data Transfer: Page is read into memory.
- CPU Reallocation (Optional but Crucial): While waiting for the very slow I/O (steps 5a-5c), the OS suspends the faulting process and gives the CPU to another ready process. This maintains CPU utilization (multiprogramming).
- I/O Completion Interrupt: Disk signals the read is done.
- Save Other Process State: If step 6 happened, save the state of the process currently on the CPU.
- Diagnose I/O Interrupt: OS determines this interrupt is from the disk for the page-in request.
- Update Tables: Correct the page table and OS data structures to show the page is now in memory.
- Wait for CPU: The original process waits in the ready queue to get the CPU back.
- Restore & Resume: Restore the process's saved state (from step 2) and restart the faulting instruction.
Three Major Components of Page-Fault Service Time:
- Service the page-fault interrupt (Steps 1-4, 10): Software overhead in the OS kernel. Can be 1–100 microseconds with good coding.
- Read in the page (Step 5): The dominant cost. Device service time (seek, latency, transfer). For a traditional HDD, this is ~8 milliseconds (8,000,000 ns).
- Restart the process (Steps 11-12): Context-switch and resume overhead. Also 1–100 microseconds.
Important: The 8 ms figure is for the disk service time only. If the disk is busy, queueing delay adds even more time.
3. The Staggering Performance Impact (Numerical Example)¶
Let's plug in realistic numbers to see the devastating effect of even a low page-fault rate.
- Assume:
ma = 200 ns,Page Fault Time = 8 ms = 8,000,000 ns. - Formula becomes: EAT = (1-p) × 200 + p × 8,000,000 = 200 + 7,999,800 × p
Scenario A: 1 fault per 1,000 accesses (p = 0.001)
- EAT = 200 + 7,999,800 × 0.001 = 200 + 7,999.8 ≈ 8,200 ns (8.2 microseconds).
- Slowdown Factor: 8,200 ns / 200 ns = 41 times slower.
Scenario B: Acceptable Slowdown (10% or less)
- We want EAT ≤ 220 ns (a 10% increase from 200 ns).
- Solve: 220 ≥ 200 + 7,999,800 × p
- 20 ≥ 7,999,800 × p
- p ≤ 0.0000025
- Interpretation: To keep performance degradation under 10%, fewer than 1 memory access in 399,990 can cause a page fault.
CONCLUSION: The page-fault rate (p) must be kept EXTREMELY low for demand paging to be viable. This is why page-replacement algorithms (Section 10.4) and the principle of locality are so critical.
4. Swap Space vs. File System: Where to Page From?¶
The backing store for pages can be the swap space or the original executable file.
- Swap Space I/O is Faster than regular file system I/O because:
- It uses larger blocks.
- No file metadata lookups (directory, inode) or complex indirect allocation is needed. (See Chapter 11).
Strategies for Using Backing Store:
Copy Entire File to Swap at Startup:
- Process: At program launch, copy all pages from the executable file into the swap area.
- Pro: All subsequent paging is fast (from swap).
- Con: Slow startup, wastes time/space copying pages that may never be used.
Demand Page from File, Write-Back to Swap (Linux/Windows Hybrid):
- Process: Initially, page faults are serviced by reading pages directly from the executable file.
- Key Twist: When a dirty page (one that has been modified) is chosen for replacement, it is written to swap space, not back to the original read-only file.
- Result: All subsequent requests for that modified page come from the faster swap space. This is a practical and common compromise.
Demand Page from File Only (for Code/Read-Only Sections):
- Used for binary executable code pages, which are never modified.
- Process: These pages are paged in directly from the file system.
- On replacement, they can be simply discarded (overwritten) because a clean copy exists on disk. If needed again, re-read from the file.
- Swap space is still required for anonymous memory—pages not associated with a file, like a process's stack and heap (which are private and modifiable). Linux and BSD use this approach.
5. Special Case: Mobile Systems¶
As noted in Section 9.5.3, most mobile OS (iOS, Android) do not support traditional swapping to a dedicated swap area.
- Why? Limited flash memory write cycles and focus on app responsiveness/ battery.
- Their Strategy:
- Demand-page read-only pages (code) directly from the app's executable file.
- Under memory pressure, reclaim (discard) clean, read-only pages. They can be reloaded from the file later.
- For anonymous memory (stack/heap), iOS never reclaims it unless the app quits or voluntarily frees it. Android and others may use compressed RAM (Section 10.7) as a faster alternative to swapping to flash storage.
OVERARCHING GOAL: Regardless of the strategy, the system must minimize the page-fault rate (p) to maintain performance. The choice of backing store and paging strategy is an engineering trade-off between speed, storage wear, and implementation complexity.
10.3 Copy-on-Write¶
Motivation: Optimizing Process Creation with fork()¶
A major performance challenge in operating systems is the fork() system call, which creates a new (child) process as a duplicate of the parent.
- The Naive Approach: The traditional implementation of
fork()physically copies the parent's entire address space (all pages) into new frames for the child. This is expensive in time and memory. - The Common Reality: In many cases (like in a shell), the child process immediately calls
exec()afterfork()to replace its memory image with a new program. The costly copy of the parent's space is immediately discarded, making it completely unnecessary waste.
Solution: Use Copy-on-Write (COW), a clever optimization that defers copying until absolutely necessary.
How Copy-on-Write Works¶
The core idea is to share pages initially and only make private copies if/when either process tries to modify them.
Initial Setup (After fork() with COW):
- Instead of copying pages, the OS makes the child's page table point to the same physical frames as the parent.
- All these now-shared pages are marked as read-only in the page tables of both parent and child.
- The OS also marks them internally as copy-on-write in its bookkeeping data.
Trigger Event (The "Write"): When either the parent or the child attempts to write to a shared COW page:
- The hardware detects a write attempt to a read-only page and generates a protection fault (similar to a page fault).
- The OS trap handler identifies this as a copy-on-write fault (not a normal page fault).
- The OS then:
- Allocates a new free frame from the free-frame list.
- Copies the contents of the original shared page into this new frame.
- Maps the faulting process's page table entry to this new copy, and changes the permission for that entry to read-write.
- Leaves the other process's mapping (to the original page) unchanged and still read-only.
- The OS then restarts the faulting instruction, which now executes successfully on the private copy.
Visual Guide (Go to Figures 10.7 and 10.8):
- Figure 10.7 (Before Modification): Shows two processes sharing physical pages A, B, and C. Page C is marked copy-on-write (implied by the sharing).
- Figure 10.8 (After Process 1 Modifies Page C): Process 1 now has its own private copy of page C. Process 2 still points to the original page C. The shared pages A and B remain shared, as they were not modified.
Key Benefits and Details¶
- Efficiency: Only modified pages are copied. All unmodified pages (like large sections of program code) remain shared, saving significant memory and copy time.
- What Needs to be COW? Only modifiable pages (data, stack, heap) need the COW protection. Read-only pages (like executable code) can be safely shared forever and never need copying.
- Widespread Use: This optimization is standard in modern OSes like Windows, Linux, and macOS.
Special Case: The vfork() System Call¶
UNIX systems offer an even more extreme optimization called vfork().
- How it works:
vfork()creates the child process without any page duplication or COW setup. The child process borrows the parent's entire address space temporarily, and the parent is suspended. - Critical Danger: The child operates directly on the parent's memory. Any modification the child makes is immediately visible to the parent when it resumes. This can easily cause corruption if not handled correctly.
- Strict Usage Rule:
vfork()is only safe if the child immediately callsexec()or_exit().exec()replaces the address space, and_exit()terminates, so no long-term sharing occurs. - Purpose & Efficiency: It is designed for the classic "fork-and-exec" pattern where a shell launches a command. It is extremely fast because it avoids all page table and frame management overhead at creation time. It is sometimes used to implement shell interfaces.
Summary: Copy-on-Write is a lazy-evaluation technique for memory management that optimizes fork() by sharing pages and copying only on a write attempt. vfork() is a more dangerous, non-lazy alternative used for maximum speed in specific, controlled scenarios.
10.4 Page Replacement¶
The Context: Memory Over-Allocation¶
Our previous analysis assumed a page is loaded once and stays in memory. But to use memory efficiently, operating systems intentionally over-allocate memory—they allow the total size of all active processes' working sets to exceed the available physical RAM.
- Why Over-Allocate? Because of demand paging and locality. A 10-page process might only actively use 5 pages at a time.
- Benefit: We can run more processes (higher degree of multiprogramming), keeping the CPU busy and increasing throughput.
- Example: With 40 frames, you could run:
- 4 processes if each needed all 10 pages (4 * 10 = 40 frames).
- 8 processes if each only uses 5 pages on average (8 * 5 = 40 frames used, but 8 * 10 = 80 pages allocated). The spare frames are a buffer.
The Inevitable Problem: Running Out of Free Frames¶
Over-allocation works until the combined demand spikes.
- Scenario: All 6 processes might simultaneously need their full 10 pages, demanding 60 frames when only 40 exist.
- Additional Strain: Physical memory isn't just for program pages. A significant portion is used for I/O buffers (caching disk data). The OS must balance memory between processes and I/O, a complex task discussed in Section 14.6.
The Crisis Point: When a page fault occurs and the OS goes to the free-frame list to bring in the needed page, it discovers the list is empty. All frames are in use.
Possible OS Responses to No Free Frames¶
Terminate the Faulting Process:
- Why it's bad: Demand paging's goal is transparency and efficiency. Crashing a process because the system is over-committed breaks the illusion of abundant memory and is unfair to the user. This is not an acceptable solution.
Swap Out an Entire Process (Traditional Swapping):
- Action: Move one entire process's pages to disk, freeing all its frames.
- Drawback: This reduces the degree of multiprogramming. More critically, as discussed in Section 9.5, copying entire processes is too slow and inefficient for modern systems with large address spaces. This method is largely obsolete.
The Standard Solution: Page Replacement¶
The modern, universal approach is to combine swapping and paging at the page granularity.
- Core Idea: When no free frame exists, the OS selects a victim frame that is currently in use by some process. It frees this victim frame to serve the new page fault. This is a two-step process:
- Find a "victim" page currently in memory. Decide which one to evict using a page-replacement algorithm.
- Free the victim's frame:
- If the victim page is clean (not modified since being loaded), it can simply be overwritten (its copy on disk is still valid).
- If the victim page is dirty (modified), it must first be written back to disk (swapped out) to save its contents, then its frame can be reused.
This is the fundamental technique. The remainder of this section will detail the algorithms and intricacies of choosing the victim page effectively.
10.4.1 Basic Page Replacement¶
The Core Algorithm¶
When a page fault occurs and no free frame is available, the OS must execute the page-replacement algorithm. The goal is to free a frame that is currently in use so it can be given to the faulting page.
The Page-Replacement Process (Follow along with Figure 10.10):
- Locate Desired Page: OS determines the disk address of the needed page.
- Find a Free Frame:
- a. If the free-frame list has a frame, use it. Go to step 3.
- b. If no free frame exists (the "?" in Figure 10.9), use a page-replacement algorithm to select a victim frame currently held by some process.
- c. Free the Victim Frame:
- Check if the victim page is dirty (modified). If yes, schedule a write (page-out) of the victim page's contents to the swap space/backing store.
- Update the victim page's page table entry, marking it as invalid and not resident.
- The frame is now considered free.
- Page-In: Read the desired page from disk into the newly freed frame. Update the faulting process's page table to mark the page valid and point to this frame.
- Restart: Resume execution of the faulting process from the interrupted instruction.
Crucial Note: If a victim page must be written out, the page fault now requires two disk I/O operations: one write (page-out) and one read (page-in). This doubles the page-fault service time, severely impacting performance.
Performance Optimization: The Modify (Dirty) Bit¶
To minimize this double-I/O penalty, hardware provides a modify bit (or dirty bit) in each page table entry (or TLB entry).
- How it works: The CPU's memory management hardware automatically sets this bit to 1 whenever the process performs a write operation to any location within that page.
- Use during Page Replacement: When selecting a victim, the OS checks its dirty bit.
- If Dirty Bit = 1: The page has been modified. Its copy on disk is stale. The OS must write the page back to disk (page-out) before freeing its frame.
- If Dirty Bit = 0: The page is clean. The copy in memory is identical to the copy on disk. The OS can simply discard the page (overwrite the frame) without any disk write. The required disk I/O is just the page-in.
Impact: This can halve the I/O time for a page fault if the victim page is clean. This is especially beneficial for read-only pages (like program code), which are always clean and can be discarded instantly.
The Illusion Completed: Logical vs. Physical Memory Separation¶
Page replacement is the final piece that completes the powerful illusion of virtual memory.
- Without Demand Paging: The logical address space can be larger than physical memory, but all pages of a running process must still be in RAM.
- With Demand Paging + Page Replacement: This constraint is removed. The size of a process's logical address space is no longer limited by physical memory.
- Example: You can run a 20-page process on a system with only 10 physical frames.
- How? Only 10 pages reside in memory at any instant. When the process needs an 11th page, the page-replacement algorithm selects one of the current 10 as a victim, swaps it if needed, and loads the new page. The process can access a virtual address space much larger than physical RAM.
The Two Fundamental Design Problems¶
To make this work efficiently, OS designers must solve two interrelated problems:
- Frame-Allocation Algorithm: When there are multiple processes in memory, how do we distribute the available physical frames among them?
- Do we give each process an equal number? Or proportional to its size? Or based on its priority?
- Page-Replacement Algorithm: When a page fault occurs and we need a free frame, which specific page should we select as the victim?
- The goal is to choose the victim such that future page faults are minimized.
Why These Algorithms Are Critical: Disk I/O is extremely slow (milliseconds) compared to memory access (nanoseconds). Therefore, even a small improvement in the hit rate of these algorithms (reducing the number of page faults) leads to a massive gain in overall system performance. The following sections will explore specific algorithms for page replacement.
Evaluating Page-Replacement Algorithms¶
Many different page-replacement algorithms exist, each with its own strategy for selecting a victim. The primary metric for comparing them is: Goal: Choose the algorithm that yields the lowest page-fault rate.
We evaluate algorithms by simulating their behavior on a recorded sequence of memory accesses.
The Reference String¶
To test an algorithm, we use a reference string—a sequence of memory accesses generated by a process.
- How to get it: You can either:
- Generate one artificially (e.g., using a random-number generator).
- Trace a real system and record every memory address accessed (this produces massive data—millions per second).
- Data Reduction - Two Key Simplifications:
- Focus on Page Numbers: For a fixed page size (e.g., 4KB), we only care about the page number, not the full address. The offset within the page is irrelevant for replacement decisions.
- Remove Consecutive Duplicates: If a page is referenced multiple times in a row, only the first reference can cause a page fault. Subsequent back-to-back references will always find the page already in memory. We compress these consecutive duplicates into a single entry in the reference string.
Example of Generating a Reference String:
Raw address sequence (in decimal, assuming 100-byte pages):
0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103, 0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105
- Convert to Page Numbers: Divide each address by page size (100).
- 0100 / 100 = Page 1
- 0432 / 100 = Page 4
- 0101 / 100 = Page 1
- ... and so on.
- Intermediate result:
1, 4, 1, 6, 1, 1, 1, 1, 6, 1, 1, 1, 1, 6, 1, 1, 1, 1, 6, 1, 1
- Remove Consecutive Duplicates: Collapse runs of the same page number.
- Final reference string:
1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
- Final reference string:
The Impact of Available Frames¶
The performance of any algorithm depends critically on the number of page frames (m) available in physical memory.
- More Frames = Fewer Faults: As you increase
m, the page-fault rate decreases. There's more room to keep pages, so fewer replacements are needed. - Example with our string (
1,4,1,6,1,6,1,6,1,6,1):- With 1 frame: Every new page reference forces a replacement. Faults = 11.
- With 3 frames: Can hold pages 1, 4, and 6 simultaneously after the first few faults. Faults = 3 (first access to pages 1, 4, and 6 only).
- The Curve (Go to Figure 10.11): The relationship is not linear. The graph shows number of page faults (y-axis) vs. number of frames (x-axis). The curve drops sharply at first, then levels off to a minimum. Adding more physical memory (more frames) moves you right on this curve, reducing faults.
Standard Test Setup¶
To compare algorithms side-by-side, we will use a common reference string and a fixed, small number of frames. This creates a constrained scenario where replacement decisions matter.
- Reference String for Examples:
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 - Number of Frames:
3 - We will analyze how different algorithms process this string and count their resulting page faults.
10.4.2 FIFO Page Replacement¶
The Algorithm: First-In, First-Out¶
The FIFO (First-In, First-Out) page-replacement algorithm is the simplest to understand and implement. It treats the set of pages in memory like a queue.
- Rule: When a page must be replaced, select the oldest page—the one that has been in memory the longest.
- Implementation: Maintain a FIFO queue of all pages currently in memory.
- When a page is loaded into a frame, insert it at the tail (end) of the queue.
- When replacement is needed, remove the page at the head (front) of the queue for eviction.
Example Execution (Follow Figure 10.12)¶
Let's trace the algorithm with our standard test: Reference string = 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 and 3 frames.
The table (Figure 10.12) shows the contents of the 3 frames after each reference. A page fault (F) triggers a load or a replacement.
| Step | Ref | Frame Contents (Queue: Head -> Tail) | Fault? | Notes |
|---|---|---|---|---|
| 1 | 7 | [7] | F | Load 7. Queue: 7 |
| 2 | 0 | [7, 0] | F | Load 0. Queue: 7, 0 |
| 3 | 1 | [7, 0, 1] | F | Load 1. Queue: 7, 0, 1 |
| 4 | 2 | [2, 0, 1] | F | Replace oldest (7) with 2. Queue: 0, 1, 2 |
| 5 | 0 | [2, 0, 1] | - | 0 is already in memory. |
| 6 | 3 | [2, 3, 1] | F | Replace oldest (0) with 3. Queue: 1, 2, 3 |
| 7 | 0 | [2, 3, 0] | F | Replace oldest (1) with 0. Queue: 2, 3, 0 |
| 8 | 4 | [4, 3, 0] | F | Replace oldest (2) with 4. Queue: 3, 0, 4 |
| ... | ... | ... | ... | ... (Continues as per figure) |
Total Page Faults with FIFO and 3 frames = 15.
Pros and Cons of FIFO¶
- Advantage: Extremely simple to implement (just a queue).
- Disadvantage: Performance is often poor because it ignores the usage pattern.
- It might evict an old but frequently used page (like a common variable), causing an immediate extra page fault to bring it back.
- It might keep a new but unused page just because it was loaded recently.
Important: Even a bad choice doesn't cause incorrect execution—it just hurts performance by increasing the fault rate.
The Shock: Belady's Anomaly¶
FIFO exhibits a counter-intuitive and problematic behavior known as Belady's Anomaly.
- What it is: For the FIFO algorithm, increasing the number of frames can sometimes lead to an increase in the number of page faults.
- Normal Expectation: More memory (more frames) should always result in fewer or equal page faults. (See the general curve in Figure 10.11).
- FIFO's Violation: The curve for FIFO (shown in Figure 10.13) is not always decreasing. It can have bumps where it goes up.
Example from the text:
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
- With 3 frames: 9 page faults.
- With 4 frames: 10 page faults (More frames, more faults!).
Why this happens: The order of eviction in FIFO depends entirely on arrival time, not on future need. Adding a frame can change the sequence of pages in the queue in a way that accidentally ejects a page that will be needed again very soon, while a smaller set might have retained it.
Significance: Belady's Anomaly shows that FIFO is not a robust algorithm. A good page-replacement algorithm should guarantee that more frames never result in more faults (the stack property). FIFO fails this test.
10.4.3 Optimal Page Replacement¶
The Ideal Algorithm: Look into the Future¶
The discovery of Belady's Anomaly prompted the question: What is the best possible page-replacement algorithm? This ideal algorithm is called OPT (Optimal) or MIN (Minimum).
- Rule: When a page must be replaced, select the page whose next use will occur farthest in the future. In other words, replace the page you will need the longest time from now.
- Result: This algorithm guarantees the lowest possible page-fault rate for any given number of frames and any reference string. It also never suffers from Belady's Anomaly (more frames never cause more faults).
Example Execution (Follow Figure 10.14)¶
Let's trace OPT with our standard test: Reference string = 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 and 3 frames.
The table (Figure 10.14) shows the frame contents. The key is that OPT makes decisions based on future knowledge of the reference string.
Total Page Faults with OPT and 3 frames = 9.
Performance Comparison: OPT's 9 faults is vastly better than FIFO's 15 faults. If we ignore the first three compulsory faults that every algorithm must incur, OPT has 6 faults vs. FIFO's 12 faults—OPT is twice as good. This is the gold standard.
The Major Limitation: Unrealizable¶
OPT is not implementable in a real, general-purpose operating system.
- Why? It requires perfect future knowledge of the entire reference string—i.e., knowing exactly which pages a process will access and in what order, before it happens.
- Analogy: This is similar to the Shortest-Job-First (SJF) CPU scheduling algorithm (Section 5.3.2), which also requires knowing the future (burst times).
Practical Use: A Benchmark for Comparison¶
Because OPT is unimplementable, its real value is as a theoretical benchmark.
- Researchers and OS designers use it to evaluate real, practical algorithms.
- Example Metric: "Our new algorithm produces, at worst, 12.3% more page faults than OPT, and on average only 4.7% more."
- By measuring how close a real algorithm gets to OPT, we can understand its efficiency and quality.
Conclusion: OPT defines the limit of what is possible. All real page-replacement algorithms are attempts to approximate OPT's behavior using only past information and heuristics, without needing a crystal ball.
10.4.4 LRU Page Replacement¶
The Concept: Using the Recent Past to Predict the Near Future¶
Since the optimal (OPT) algorithm is impossible to implement (it needs future knowledge), we need a feasible approximation. The key insight:
- OPT looks forward: Replaces the page used farthest in the future.
- FIFO looks at load time: Replaces the page loaded longest ago.
- LRU (Least Recently Used) approximation: Uses the recent past as a predictor. It assumes that pages heavily used in the recent past will likely be used again in the near future. Therefore, it protects them.
Rule: When a page must be replaced, LRU selects the page that has not been used for the longest period of time. It replaces the least recently used page.
Interesting Theoretical Note: If you reverse a reference string (S becomes S_R), the page-fault rate of OPT on S equals OPT on S_R. Similarly, LRU on S equals LRU on S_R. This symmetry highlights LRU as "OPT looking backward in time."
Example Execution (Follow Figure 10.15)¶
Tracing LRU with our standard test: Reference string = 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 and 3 frames.
We must now track the order of last use for pages in memory.
| Step | Ref | Frame Contents (LRU Order: Most Recent -> Least Recent) | Fault? | Reasoning for Replacement |
|---|---|---|---|---|
| 1 | 7 | [7] | F | Load. |
| 2 | 0 | [0, 7] | F | Load 0. Now 0 is most recent, 7 is least recent. |
| 3 | 1 | [1, 0, 7] | F | Load 1. Order: 1 (newest), 0, 7 (oldest). |
| 4 | 2 | [2, 1, 0] | F | Need a frame. LRU page is 7 (last used at step 1). Evict 7, load 2. New order: 2, 1, 0. |
| 5 | 0 | [0, 2, 1] | - | Hit on 0. Reorder: Access 0 moves it to front. New order: 0 (newest), 2, 1 (oldest). |
| 6 | 3 | [3, 0, 2] | F | Need a frame. LRU page is 1. Evict 1, load 3. New order: 3, 0, 2. |
| 7 | 0 | [0, 3, 2] | - | Hit on 0. Reorder: 0 moves to front. Order: 0, 3, 2. |
| 8 | 4 | [4, 0, 3] | F | Need a frame. LRU page is 2. Evict 2, load 4. New order: 4, 0, 3. |
| 9 | 2 | [2, 4, 0] | F | Need a frame. LRU page is 3. Evict 3, load 2. New order: 2, 4, 0. |
| ... | ... | ... | ... | ... (Continue as per Figure 10.15) |
Critical juncture at step 8 (ref 4): According to the Figure 10.15, just before this fault, the frames held pages 2, 0, 3 (with an LRU order we must deduce). The text explains: LRU sees that of {2, 0, 3}, page 2 was used least recently (it was loaded at step 4 and not touched since, while 0 was used at step 5 and 7, and 3 at step 6). So LRU incorrectly evicts page 2, not knowing it will be needed again immediately at step 9. This shows LRU's fallibility compared to OPT.
Total Page Faults with LRU and 3 frames = 12.
Performance: LRU (12 faults) is much better than FIFO (15) though not as good as OPT (9). It's a good practical approximation.
Major Challenge: Implementing LRU¶
The LRU policy is excellent, but implementing it efficiently requires significant hardware support. We need to track the order of every single memory reference.
Two Theoretical Implementation Methods:
Counters (Time-of-Use Stamp):
- Each page-table entry has a time-of-use field.
- The CPU has a logical clock/counter that increments on every memory reference.
- On each memory access (read or write), the current clock value is copied into the time-of-use field for the referenced page.
- On replacement: The OS must scan all page-table entries in memory to find the page with the smallest time stamp (least recently used).
- Overheads:
- A write to the page table in memory on every access (very expensive).
- A full table scan on every page fault.
- Clock overflow must be handled.
Stack (Doubly Linked List):
- Maintain a stack of page numbers (not a software LIFO stack, but an ordered list).
- On a page reference: That page is removed from its current position in the stack and placed on the top (made most recent).
- Structure: Best implemented as a doubly linked list with head (most recent) and tail (least recent) pointers.
- Operation: Moving a page from the middle to the top requires updating up to 6 pointers, but it's a constant-time operation (
O(1)). - Replacement: The tail pointer always points to the LRU page—instant identification, no search needed.
- This method is suitable for microcode or firmware implementation.
Crucial Hardware Requirement: Both methods must update their data structures on every memory reference. Doing this via software interrupts would slow down each memory access by an order of magnitude (e.g., 10x), which is unacceptable. Therefore, direct hardware support (like dedicated registers, counters, or microcode) is essential, making pure LRU rare in its exact form.
Stack Algorithms and Belady's Anomaly¶
LRU shares a vital property with OPT: it does not suffer from Belady's Anomaly.
- Reason: Both are stack algorithms.
- Stack Algorithm Property: The set of pages in memory for
nframes is always a subset of the set of pages that would be in memory withn+1frames. - For LRU: The pages in memory are the
nmost recently referenced pages. If you add a frame (n+1), you simply keep one more of the most recent pages. You never lose a page you had before. - Consequence: With a stack algorithm, increasing the number of frames never increases the page-fault rate. The fault curve (like Fig. 10.11) is always non-increasing.
10.4.5 LRU-Approximation Page Replacement¶
The Reality: Limited Hardware Support¶
Few computer systems provide the full hardware needed for exact LRU (counters or stack updating on every memory access). Many systems provide only minimal support, forcing the use of simpler algorithms like FIFO. However, a common middle ground is the reference bit.
- What is a Reference Bit? A single bit stored in each page table entry (or in a separate hardware array).
- Hardware Action: The memory management unit (MMU) hardware automatically sets this bit to 1 whenever the corresponding page is referenced (read from or written to).
- Initial State: The operating system clears all reference bits to 0 periodically.
- Information Gained: The reference bit tells us whether a page has been used recently (1) or not (0), but it provides no information about order—we can't tell which of two pages with bit=1 was used more recently.
This limited data is the foundation for algorithms that approximate LRU behavior.
10.4.5.1 Additional-Reference-Bits Algorithm¶
To get a crude sense of usage history over time, we can extend the single reference bit into a history of reference bits.
How it works:
- Data Structure: For each page, the OS maintains an 8-bit byte (or similar length) in a table in memory. Think of it as a shift register.
- Timer Interrupt: At regular intervals (e.g., every 100 milliseconds), a timer interrupt gives control to the OS.
- Recording History: For each page, the OS performs a shift operation on its 8-bit register:
- The current hardware reference bit (1 or 0) is shifted into the leftmost (high-order) bit of the register.
- All existing bits are shifted one position to the right.
- The rightmost (low-order) bit is discarded.
- Interpretation: After many intervals, this 8-bit register holds a history of the reference bits for the last 8 time periods.
- Example Register Values:
00000000: Page not referenced in the last 8 periods. Very cold.11111111: Page referenced in every recent period. Very hot.11000100: Page used recently (leftmost bits=1) and also some time ago, but with gaps. Warmer than01110111because its most recent history (left side) shows more recent use.
- Example Register Values:
Selecting a Victim for Replacement:
- Treat each 8-bit register as an unsigned integer (0 to 255).
- The page with the smallest numerical value is considered the least recently used (it has the fewest recent references).
- Tie-Breaking: If multiple pages have the same lowest value, the OS can either replace all of them or use a secondary rule (like FIFO) to pick one.
Trade-Off:
- More bits provide a longer, more detailed history but increase per-page memory overhead and update time.
- Fewer bits make the update faster but give a less accurate history.
The Extreme Case (1-bit history): If we reduce the history length to zero, we are left with only the current reference bit. This special case is so important it has its own name: the Second-Chance page-replacement algorithm, which we will cover next. It's the simplest practical LRU approximation.
10.4.5.2 Second-Chance Algorithm (The Clock Algorithm)¶
This is a practical and widely-used approximation of LRU that requires only a single reference bit per page.
Basic Idea: FIFO with a Second Chance¶
The foundation is a FIFO queue, but we use the reference bit to grant a reprieve.
- FIFO Selection: Start by considering the page at the head of the FIFO queue (the oldest page).
- Inspect Reference Bit:
- If Reference Bit = 0: This page hasn't been used recently. It gets no second chance. Evict it immediately.
- If Reference Bit = 1: This page has been used since the last check. Give it a second chance:
- Clear its reference bit to 0.
- Move it to the tail of the queue (updating its "arrival time" to now).
- Repeat: Move to the next page in the queue and repeat step 2 until you find a page with reference bit = 0 to evict.
Important Behaviors:
- A page that is continuously used will keep its reference bit set to 1 and will never be evicted (it always gets a second chance).
- If all pages have reference bit = 1, the algorithm will make a full cycle, clearing all bits, and then replace the original head page (which now has bit=0). In this full-cycle case, it degenerates to pure FIFO.
Circular Queue Implementation (The Clock Analogy)¶
The algorithm is best visualized and implemented as a circular queue (a "clock" face with a moving hand).
- Data Structure: All pages in memory are arranged in a circular list. A clock hand (pointer) points to the next candidate for inspection.
- Procedure (Follow Figure 10.17):
- (a) Initial State: Hand points to a page. We need a free frame.
- Scan: Check the reference bit of the page under the hand.
- If bit = 1: Clear the bit to 0 and advance the hand to the next page.
- If bit = 0: This is the victim. Select this page for replacement.
- (b) After Scanning: The hand has advanced, clearing bits (setting them to 0) as it passed pages with bit=1.
- Replacement: Replace the victim page, insert the new page into the victim's position in the circular list, and advance the hand to the next position.
Why "Clock"? The hand sweeps around the circular list like a clock's hand, giving pages a "second chance" if they've been used (bit=1).
10.4.5.3 Enhanced Second-Chance Algorithm¶
This is a significant improvement that considers both usage and modification to minimize costly disk writes.
We now use two bits per page:
- Reference Bit (
R): As before (1 = used recently, 0 = not). - Modify (Dirty) Bit (
M): (1 = page has been modified, 0 = clean).
This creates four ordered classes (R, M):
| Class | (R, M) |
Meaning | Priority to Replace |
|---|---|---|---|
| 1 | (0, 0) |
Not recently used, Not modified (Clean and cold). Best victim. | Highest (Best) |
| 2 | (0, 1) |
Not recently used, But modified (Dirty but cold). Good victim, but requires a write-back (page-out). | Second Best |
| 3 | (1, 0) |
Recently used, Clean (Hot but clean). Likely to be used again soon. | Third |
| 4 | (1, 1) |
Recently used, Modified (Hot and dirty). Very likely to be used again and requires write-back. | Lowest (Worst) |
The Enhanced Algorithm (Multi-Pass Clock)¶
The algorithm works like the clock algorithm but makes multiple passes looking for the best class.
- Start with the clock hand at its current position.
- Scan the circular list. For each page, examine its
(R, M)class.- The goal is to find the first page in the lowest non-empty class.
- First Pass: Look for a
(0, 0)page (ideal victim). If found, replace it. - Second Pass (if needed): If no
(0, 0)exists, make another scan looking for a(0, 1)page. During this scan, do NOT clear the R bit. We are only looking for pages that have remained unreferenced (R=0) since the first pass. If found, replace it (but must write it to disk first). - Third/Fourth Pass: If necessary, continue scanning for
(1, 0)and finally(1, 1). If all pages are(1,1), the algorithm will eventually select one after a full cycle (effectively becoming FIFO among dirty pages).
Key Advantage: This algorithm actively prefers clean pages over dirty ones to avoid the expensive disk write operation, while still respecting recent usage. It's a very effective balance and is used in real systems (a variant is known as the Not Recently Used (NRU) algorithm).
10.4.6 Counting-Based Page Replacement¶
These algorithms try to make decisions based on the frequency of page accesses rather than (or in addition to) recency.
Least Frequently Used (LFU)¶
- Rule: Replace the page that has been used the least number of times (has the smallest reference count).
- Rationale: A page that is accessed frequently is important and should stay in memory.
- Major Problem: Staleness. A page might be used intensively during an initialization phase (giving it a high count) and then never used again. LFU will keep this no-longer-needed page in memory because of its historically high count, while evicting a newer, more actively used page with a lower count.
- Partial Solution: Aging. To make counts reflect recent usage, periodically shift all counts right by 1 bit (divide by 2). This creates an exponentially decaying average. Older references fade away, giving more weight to recent activity.
Most Frequently Used (MFU)¶
- Rule: Replace the page with the largest reference count.
- Counterintuitive Rationale: The argument is that the page with the smallest count was probably just brought in and hasn't had a chance to be used yet, so it might be needed soon. Conversely, a page with a very high count has already been used heavily and might not be needed immediately.
- This logic is generally considered flawed for typical program behavior (locality suggests recently used pages will be used again).
Verdict on Counting Algorithms: Neither LFU nor MFU is common in practice.
- Implementation Cost: Requires maintaining and updating counters for every page access, which is expensive.
- Poor Approximation of OPT: They don't model the "future need" principle of OPT as effectively as LRU-based approximations do.
10.4.7 Page-Buffering Algorithms¶
These are supplemental techniques used to improve the performance of any core page-replacement algorithm (like FIFO, LRU, or Clock). They focus on optimizing I/O and reducing the latency felt by the faulting process.
1. Free-Frame Pool with Immediate Restart¶
- Standard Problem: If the victim page is dirty, the process must wait for a page-out (write) to complete before the needed page can be page-in (read). This sequential I/O doubles the wait time.
- Buffering Solution: Always maintain a pool of free frames.
- Procedure on a page fault:
- Select a victim page using your main algorithm (LRU, Clock, etc.).
- Immediately take a free frame from the pool and initiate the page-in of the desired page into this free frame.
- In parallel, write the dirty victim page out to disk (page-out).
- Once the page-in completes, the process can restart immediately.
- When the page-out completes, the victim's frame is added back to the free pool.
- Benefit: The process only waits for the page-in time. The page-out I/O happens concurrently in the background. This hides the write latency.
2. List of Modified Pages (Clustering of Writes)¶
- Goal: Increase the chance that a selected victim is clean, avoiding the write-back altogether.
- Method: The OS maintains a list of pages that have been modified (dirty).
- Background Daemon: When the paging disk is idle, this daemon selects pages from the dirty list, writes them out to disk in batches, and clears their modify bits (making them clean).
- Benefit: When the page-replacement algorithm later needs a victim, many pages will already be clean and can be instantly overwritten. This also clusters writes (multiple pages written in one efficient disk operation), which is much faster than writing single pages sporadically.
3. Free-Frame Pool with Page Cache¶
- Enhanced Free Pool: Don't just keep free frames empty. Remember which page used to be in each frame in the free pool.
- Why? If a process faults on a page that was recently evicted, and that page's frame is still in the free pool (with its contents intact because it was clean or hasn't been reused), you can reassign it immediately without any disk I/O.
- Procedure on a page fault:
- First, scan the free-frame pool to see if the desired page is already there (a soft fault or reclaim). If found, just re-link it to the process's page table. Zero I/O.
- If not in the pool, then proceed with the normal page-replacement routine (select victim, use a free frame, perform disk I/O).
- Benefit: This acts as a second-chance cache in memory, salvaging incorrect victim choices. It's especially useful with algorithms like the simple Second-Chance/Clock, which can sometimes evict a page that is needed again soon.
Real-World Use: Variants of these buffering techniques are used in many systems. For example, some UNIX versions combine the Second-Chance algorithm with a free-frame cache. Section 10.5.3 will discuss these and other optimizations further.
10.4.8 Applications and Page Replacement¶
The Problem: When OS Generalization Fails¶
The operating system's virtual memory and page-replacement algorithms (like LRU) are designed as general-purpose solutions. They assume typical application behavior (locality of reference). However, for specialized applications with unique, predictable access patterns, this one-size-fits-all approach can hurt performance.
Case Study 1: Database Management Systems (DBMS)¶
- How Databases Work: A DBMS like PostgreSQL or Oracle has its own sophisticated memory manager and I/O buffer pool. It knows exactly which data pages (from tables, indexes) are "hot" or "cold" based on query patterns.
- Double Buffering Problem:
- The DBMS reads a data page from disk into its own application buffer in memory.
- The Operating System, unaware of the DBMS's structures, also takes that data and may store it in the OS page cache (another copy in memory).
- Result: The same disk block exists twice in RAM—once in the DBMS buffer, once in the OS cache. This wastes memory that could be used to cache more data.
- Solution: Many databases use Direct I/O or similar mechanisms to bypass the OS page cache, reading data directly from disk into the application's own buffers, avoiding duplication.
Case Study 2: Data Warehouses & Large Sequential Scans¶
- Typical Workload: Read a massive dataset sequentially (e.g., a multi-gigabyte table), perform analysis/aggregation, then write results.
- Why LRU is Terrible Here:
- LRU treats the most recently used (newest) pages as most important.
- In a sequential scan, old pages (the ones read earlier) are not needed again until the next full scan. The "new" pages are the ones just read.
- LRU will evict the old pages and keep the new ones, which is exactly the opposite of what's needed if you plan to loop back and start the scan over. You want to keep the oldest pages (the start of the dataset) because you'll re-read them soon.
- Irony: In this specific pattern, the MFU (Most Frequently Used) algorithm—which is generally poor—might outperform LRU! Why? The first pages of the dataset are accessed at the start of every scan, giving them a very high frequency count. MFU would keep these high-count pages, which is beneficial for a looping sequential scan.
The OS Escape Hatch: Raw I/O¶
For applications that need full control, some operating systems offer a raw disk interface.
- What it is: A raw disk (or raw partition) is a section of secondary storage presented to the application as a simple, linear array of logical blocks, with no file system structure (no inodes, directories, file names, etc.).
- Raw I/O: The application performs I/O directly to this raw partition. This operation bypasses all OS file services:
- File I/O Demand Paging
- File Locking
- Prefetching
- Space Allocation
- Naming and Directory Lookup
- Benefit: The application has complete control over caching, read-ahead, and data layout. It can implement an access pattern perfectly suited to its task.
- Major Drawback: The application takes on massive complexity—it must implement all the services a file system provides (crash consistency, space management, etc.). This is only viable for highly specialized, complex software like databases.
Key Takeaway¶
There is a fundamental tension between:
- OS Generalization: Providing transparent, convenient, safe virtual memory and file services for all applications.
- Application Specialization: Allowing performance-critical, knowledgeable applications to manage their own memory and I/O for optimal efficiency.
Most applications benefit greatly from the OS's services. But for a critical few, the OS mechanisms can be a bottleneck, necessitating features like Direct I/O or Raw I/O to relinquish control.
10.5 Allocation of Frames¶
The Core Question¶
We now move from deciding which page to replace to deciding how many frames to give to each process. If the system has a pool of free frames, how should we distribute them among competing processes?
The Simplest Strategy: Global Allocation from a Free List¶
Consider a system with 128 physical frames. The OS itself might use 35 for kernel code and tables, leaving 93 free frames for user processes.
- Initial State: All 93 user frames are on the free-frame list.
- Process Start: A user process begins and causes page faults.
- Allocation: The first 93 page faults are satisfied by simply taking frames from the free list. No replacement is needed yet.
- After Exhaustion: On the 94th page fault, the free list is empty. Now the page-replacement algorithm must be invoked to select a victim frame from among the 93 frames already allocated to this same process, free it, and then load the new page.
- Process Termination: When the process ends, all its frames are returned to the free-frame list.
Variations on this theme:
- The OS can allocate its own buffers from the free list too, allowing unused OS memory to temporarily support user paging.
- Maintain a small reserve (e.g., 3 frames) on the free list to handle a page fault immediately while a replacement victim is being selected and written out in the background.
Key Point of Simple Strategy: Frames are allocated on-demand from a global pool. The process doesn't have a fixed, guaranteed allocation; it competes for all free frames until memory is full, and then it competes with itself (through its own page replacement).
10.5.1 Minimum Number of Frames¶
We cannot allocate frames arbitrarily. There is both a lower bound (minimum) and an upper bound (maximum) for the frames a process must/can have.
Why a Minimum? Two Reasons:¶
- Performance: Too few frames lead to an excessively high page-fault rate, crippling performance (thrashing, discussed later).
- Correctness: An instruction must be able to complete execution after a page fault is resolved. This requires enough frames to hold all the distinct pages a single instruction might reference.
Architectural Minimum: Ensuring Instructions Can Restart¶
The CPU architecture dictates the absolute minimum number of frames. The requirement stems from the need to restart any instruction after a page fault (as discussed in Section 10.2.1).
Let's build examples:
Case 1: Simple Single-Address Instruction
- Instruction:
LOAD R1, A(Load from memory addressAinto register R1). - This instruction references:
- The instruction itself (page
P_instr). - The operand at address
A(pageP_A).
- The instruction itself (page
- Minimum frames required: 2. One for the code page, one for the data page. If you had only 1 frame, a page fault for either component would evict the other, making restart impossible.
- Instruction:
Case 2: Adding One Level of Indirect Addressing
- Instruction:
LOAD R1, (A)(Load from the address pointed to by the contents of memory locationA). - This can reference:
- The instruction (page
P_instr). - The indirect word at address
A(pageP_A). - The final operand at the address stored in
A(pageP_final).
- The instruction (page
- All three could be on different pages.
- Minimum frames required: 3. Try with 2 frames: If
P_instrandP_Aare in memory, and a fault occurs forP_final, you'd have to evict one of them. If you evictP_instr, you cannot restart. If you evictP_A, you lose the pointer needed to find the operand. Deadlock.
- Instruction:
Case 3: Complex Instructions (e.g., IBM MVC)
- Instructions that modify multiple memory locations (like block moves) may require even more frames to guarantee a safe, restartable state, as their microcode might need to pre-access all source and destination pages.
Case 4: Modern x86-64 Architecture
- Intel/AMD architectures are generally register-to-register or register-to-memory. They don't allow direct memory-to-memory operations in a single instruction. This limits the potential memory references per instruction, keeping the architectural minimum low (often 2).
Summary of Frame Limits¶
- Minimum Number of Frames: Defined by the computer architecture to ensure instruction restartability. It's a hard lower limit.
- Maximum Number of Frames: Defined by the total available physical memory (minus OS needs).
- The Allocation Challenge: The interesting design space for the OS is in the wide range between this minimum and maximum. How do we decide on a fair and efficient allocation within this range? This leads to allocation algorithms.
10.5.2 Allocation Algorithms¶
We need a policy to decide how many frames (a_i) to give to each process P_i. Let:
m= Total number of available free frames (for user processes).n= Number of competing processes.
1. Equal Allocation¶
The simplest approach: divide the frames equally among all processes.
Formula:
a_i = floor(m / n)for each process.Leftover Frames: Any remainder frames (
m mod n) are kept as a global free-frame buffer pool.Example: 93 frames, 5 processes.
- Each gets:
93 / 5 = 18frames. - Leftover:
93 mod 5 = 3frames go to the free pool.
- Each gets:
Problem: It ignores process needs. A small 10-page process gets the same allocation as a massive 1000-page process. This is inefficient—the small process wastes unused frames, while the large process may suffer excessive page faults.
2. Proportional Allocation¶
Allocate frames in proportion to each process's size (its virtual memory demand). The idea is to give a larger share to bigger processes.
Definitions:
s_i= Size of processP_i(in pages, not bytes).S= Sum of all process sizes =Σ s_i.
Formula:
a_i = (s_i / S) * mAdjustments: The result
a_imust be:- Rounded to an integer.
- At least the architectural minimum number of frames.
- The sum of all
a_imust be≤ m.
Example from text: Two processes.
m = 62frames.- Process 1:
s_1 = 10pages. - Process 2:
s_2 = 127pages. S = 10 + 127 = 137pages.- Allocation:
a_1 = (10 / 137) * 62 ≈ 4.53 → **4 frames**a_2 = (127 / 137) * 62 ≈ 57.47 → **57 frames**- Total: 4 + 57 = 61 frames. The 1 leftover frame goes to a free pool.
- Process 1:
Advantage: More "fair" in terms of need. The large process gets more resources to accommodate its working set.
Dynamic Adjustment: In both equal and proportional schemes, the allocation changes when the multiprogramming level (
n) changes.- New process arrives: Frames are taken from existing processes (each loses a few) to give to the newcomer.
- Process terminates: Its freed frames are redistributed among the remaining processes.
3. Priority-Based Allocation¶
Both equal and proportional allocation treat all processes as equal. But we often have high-priority and low-priority processes.
- Goal: Give more memory to high-priority processes to speed up their execution, even if it means slowing down low-priority ones.
- How: Modify the proportional allocation formula to use priority as a weight instead of (or in addition to) size.
- Simple Priority Scheme:
a_iis proportional to the process's priority value. - Combined Scheme:
a_iis proportional to(s_i * priority_i)or a similar combined metric.
- Simple Priority Scheme:
- Effect: A small, high-priority process could receive more frames than a large, low-priority one, ensuring it runs with fewer page faults and completes faster.
10.5.3 Global versus Local Allocation¶
This section addresses a critical design choice: When a process needs a free frame, where can it look for a victim page?
Local Replacement¶
- Definition: Each process can only select a victim page from among the frames currently allocated to itself.
- Consequence: The number of frames allocated to a process (
a_i) remains constant. The process competes only with itself. - Advantage:
- Predictable Performance: A process's page-fault rate depends only on its own behavior. It is insulated from the memory pressure caused by other processes.
- Isolation: One greedy process cannot starve others by stealing their frames.
- Disadvantage:
- Potential Inefficiency: Memory can become fragmented. A process with low memory activity cannot lend its unused frames to a process that desperately needs them. Overall system throughput may suffer.
Global Replacement¶
- Definition: A process can select a victim page from the set of all frames in the system, regardless of which process owns them.
- Consequence: The number of frames allocated to a process is dynamic. A process can gain frames by taking them from others, and lose frames if others take from it.
- Example with Priorities: A system could implement a policy where a high-priority process can take frames from low-priority processes, but not vice-versa. This directly enforces priority-based allocation.
- Advantage:
- Higher System Throughput: Memory is a global resource that flows to the processes that are most active at any given moment. This generally leads to better overall utilization and performance.
- Disadvantage:
- Unpredictable Performance: A process's performance becomes dependent on the "paging behavior of other processes". The same program could run fast when the system is idle and very slow when memory is under heavy contention from other processes.
Verdict: Global replacement is more common in practice because it achieves higher overall throughput, which is usually the system-wide goal.
Implementing Global Replacement: The Reaper Daemon with Thresholds¶
A common practical strategy for global replacement doesn't wait for the free-frame list to hit zero. Instead, it uses background reclamation based on watermark thresholds.
The Mechanism (Follow Figure 10.18):
- Two Thresholds:
- Minimum Threshold (Low-Water Mark): The free memory level below which the system is considered under memory pressure.
- Maximum Threshold (High-Water Mark): The target free memory level to reach after reclamation.
- Reaper Daemon (Pageout Daemon): A kernel routine that runs periodically or when triggered.
- Operation Cycle:
- Normal Operation (a->b): At point a, free memory falls below the minimum threshold. This triggers the reaper daemon.
- Reclamation Phase: The daemon aggressively scans all user processes (excluding kernel) using a page-replacement algorithm (often a clock/LRU approximation) to select victim pages. It writes out dirty victims and adds their frames to the free list.
- Suspension (b): Reclamation continues until free memory reaches the maximum threshold at point b. The daemon then suspends.
- Repeat (c->d): As processes run, free memory drains again. At point c, it drops below the minimum, and the cycle repeats (c->d).
Escalation Under Severe Memory Pressure¶
What if the reaper can't keep up and free memory keeps falling dangerously low?
- Tactic 1: More Aggressive Reclamation: Switch the reaper algorithm from a polite one (like Second-Chance) to a more aggressive one (like pure FIFO), reclaiming pages faster.
- Tactic 2: The Nuclear Option - OOM Killer (Linux Example): In extreme cases, the OS may terminate a process to free all its memory instantly.
- How Linux Chooses a Victim Process: Each process has an OOM score. The score increases with:
- Percentage of total memory used (biggest factor).
- Process priority (lower priority → higher score).
- Other heuristics (e.g., long-running vs. new).
- The OOM Killer selects a high-scoring process and kills it. You can view a process's score in
/proc/<pid>/oom_score.
- How Linux Chooses a Victim Process: Each process has an OOM score. The score increases with:
Configurability¶
- The threshold values (min/max) can often be tuned by a system administrator based on total physical memory and workload characteristics.
- The aggressiveness of the reaper and the OOM killer's policies can also be adjusted.
This hybrid approach—background reclamation with thresholds—provides the benefits of global replacement (high throughput) while maintaining a buffer of free memory to handle sudden demands and allowing controlled escalation under extreme pressure.
Extra: Major and Minor Page Faults¶
The Distinction¶
While all page faults occur when a process accesses a page without a valid mapping, operating systems classify them based on the work required to resolve them. This distinction helps in performance monitoring and analysis.
Major Page Fault (Hard Fault - Windows):
- When it happens: The desired page is not in physical memory (RAM) at all. It resides only on the backing store (disk/swap).
- Resolution Cost: High. Requires a physical disk I/O operation:
- Find a free frame (may involve page replacement).
- Read the page from disk into the frame.
- Update the page table.
- Associated With: True demand paging. A high rate of major faults indicates the process is actively bringing its working set into memory or is under memory pressure (thrashing).
Minor Page Fault (Soft Fault - Windows):
- When it happens: The desired page is already in physical memory, but the process's page table lacks a valid mapping to it. No disk I/O is needed.
- Causes:
- Shared Libraries/Code: The page (e.g., from
libc.so) is already in RAM because another process loaded it. The faulting process just needs its page table entry updated to point to the existing physical frame. - Reclaimed but Not Reused Page: The page was recently reclaimed from this same process (or another) and placed on the free-frame list, but its content hasn't been zeroed-out or overwritten yet. The OS can simply reassign the same frame back to the process, often without even clearing it if it's from the same process.
- Shared Libraries/Code: The page (e.g., from
- Resolution Cost: Low. Involves only updating page table entries in memory. Much faster than a major fault.
Why This Matters: Performance Insight¶
The ratio of minor to major faults is a key performance indicator.
- High Minor, Low Major Faults (Ideal): This pattern, as seen in the Linux
psoutput, indicates:- The system is efficiently using shared memory (libraries). Once a common library is loaded by one process, others fault on it minimally.
- The process's working set is largely resident in memory. Most faults are just linking to already-present pages.
- High Major Faults: Indicates the process is actively reading new data from disk (startup phase) or is thrashing (constantly swapping pages in and out due to insufficient memory).
Linux Example: ps -eo min_flt,maj_flt,cmd¶
The command output shows:
MINFL MAJFL CMD
186509 32 /usr/lib/systemd/systemd-logind
76822 9 /usr/sbin/sshd -D
1937 0 vim 10.tex
699 14 /sbin/auditd -n
- Observation: For long-running daemons (
systemd-logind,sshd), minor faults (MINFL) are orders of magnitude higher than major faults (MAJFL). - Interpretation: These processes caused many minor faults as they mapped shared libraries into their address space over time. Their major fault count is low and static, meaning they are not swapping; their necessary pages are all in memory.
- The
vimprocess shows 0 major faults, suggesting the file it's editing (10.tex) was already cached in memory, or its working set is tiny and fully resident.
Practical Takeaway: Monitoring major/minor faults helps diagnose memory performance issues. A process with a sustained high rate of major faults is likely suffering from memory contention or has a working set larger than available RAM.
10.5.4 Non-Uniform Memory Access (NUMA)¶
The End of the Uniform Memory Assumption¶
So far, we've assumed accessing any part of physical RAM takes the same time. This is Uniform Memory Access (UMA). In modern multi-CPU servers, this is often not true.
What is NUMA?
- Architecture: The system has multiple CPUs (or CPU sockets), each with its own local bank of physical memory (see Figure 10.19). All CPUs are connected via a shared system interconnect (like a high-speed bus or a mesh).
- Access Cost Disparity:
- Local Access (Fast): A CPU accessing its own local memory has low latency.
- Remote Access (Slow): A CPU accessing memory attached to another CPU must go through the shared interconnect, incurring higher latency.
- Trade-off: NUMA systems are slower on average than UMA for memory access, but they allow scaling to many more CPUs and much more total RAM by avoiding a single, monolithic memory bus bottleneck.
The Memory Allocation Challenge in NUMA¶
Standard memory allocation and page replacement algorithms treat all frames as equal. In NUMA, this is suboptimal.
- Bad Allocation: If a process running on CPU0 gets a page fault, and the OS allocates a frame from Memory3 (remote), every subsequent access to that page will suffer remote access latency, hurting performance.
- Goal: Allocate memory "close" to the CPU that will use it. "Close" means minimum latency, typically the memory bank on the same system board or NUMA node as the CPU.
NUMA-Aware Operating System Strategies¶
The OS must coordinate the scheduler and the memory allocator.
1. CPU Affinity + Local Allocation:
- Scheduler: Tries to keep a process/thread scheduled on the same CPU (or NUMA node) it ran on last. This is called CPU affinity.
- Memory Allocator (Page Fault Handler): When a page fault occurs, it allocates a frame from the free-frame list of the NUMA node where the faulting process is currently running.
- Benefit: Maximizes local accesses, improving cache hits and lowering memory latency.
2. The Complexity of Multi-threaded Processes: A single process with many threads can have its threads scheduled across different NUMA nodes. Where should its memory be allocated?
- Problem: If threads on Node 0 access memory allocated on Node 1, they suffer remote access penalties.
OS Solutions to this Problem:
- Linux Approach:
- Scheduling Domains: The kernel organizes CPUs into a hierarchy of scheduling domains representing the NUMA topology. The Completely Fair Scheduler (CFS) restricts thread migration to stay within a domain, preventing threads from wandering too far from their memory.
- Per-Node Free Lists: Linux maintains separate free-frame lists for each NUMA node. When a thread faults, it allocates from the list of the node it's running on.
- Solaris Approach:
- Locality Groups (lgroups): The kernel groups CPUs and memory into lgroups based on access latency. Each lgroup defines a set of resources with uniform, fast access.
- Hierarchy: Lgroups are arranged in a tree hierarchy based on increasing latency between them.
- Allocation Policy: The system tries to schedule all threads of a process and allocate all memory for that process within a single lgroup. If impossible, it uses the nearest possible lgroups to minimize cross-lgroup latency.
Summary: NUMA awareness requires the OS to:
- Understand the hardware topology (which CPUs and memory are close).
- Make scheduling decisions that keep threads near their memory.
- Make allocation decisions that place pages in memory local to the accessing CPU. This coordination is crucial for achieving high performance on large-scale multiprocessor systems.
10.6 Thrashing¶
Definition and Root Cause¶
Thrashing is a severe performance degradation where a process (or the entire system) spends more time swapping pages in and out of memory than doing useful work.
- Root Cause: A process does not have "enough" frames to hold its current working set—the set of pages it is actively and repeatedly using.
- The Vicious Cycle:
- The process needs a page not in memory → Page Fault.
- To free a frame, it must evict a page.
- But every page in memory is currently in active use (part of the working set).
- It evicts a page that will be needed again almost immediately.
- Shortly after, the process accesses that evicted page → Another Page Fault.
- The cycle repeats endlessly.
- Result: The process is trapped in a continuous storm of page faults. CPU utilization plummets because the process is constantly blocked waiting for disk I/O. The system spends its time moving pages instead of running code.
10.6.1 Cause of Thrashing¶
The Classic Thrashing Scenario¶
This describes a positive feedback loop of system collapse that can happen with global page replacement and naive multiprogramming control.
Step-by-Step Breakdown of the Disaster:
- Initial State: The OS monitors CPU utilization. If it's low, the OS assumes the CPU is idle and increases the degree of multiprogramming by adding a new process.
- Process Enters a New Phase: An existing process starts a new phase (e.g., a new loop, a new function) and needs more frames for its new working set.
- Fault and Steal: It begins page-faulting. Using a global page-replacement algorithm, it takes frames from other processes.
- Victim Processes React: Those other processes now need their stolen pages back. They also start page-faulting, taking frames from yet other processes.
- Disk Queue Backup: All these faulting processes generate I/O requests for the paging device (disk). A long I/O queue forms.
- CPU Idles: Processes block waiting for their page-in I/O. The ready queue empties. CPU utilization drops.
- OS Misdiagnosis: The OS scheduler, seeing the low CPU utilization, thinks the system is underloaded. It increases the degree of multiprogramming further, adding yet another process.
- New Process Aggravates: The new process also needs frames, causing more page faults and a longer disk queue.
- Complete Thrashing: CPU utilization plummets even more. The system is now thrashing: processes are constantly in the disk I/O queue, and almost no productive work is done. Throughput collapses.
The CPU Utilization Curve (Go to Figure 10.20)¶
This classic graph shows the relationship:
- X-axis: Degree of multiprogramming (number of processes).
- Y-axis: CPU utilization.
- The Curve:
- Initially, as you add processes, CPU utilization increases because it keeps the CPU busy when one process is blocked.
- It reaches an optimal point (peak).
- Beyond this point, adding more processes causes memory over-commitment. The total working set of all processes exceeds physical memory.
- Thrashing begins. CPU utilization drops sharply because the CPU is idle waiting for I/O.
- Solution: To recover, you must decrease the degree of multiprogramming (remove processes).
Can Local Replacement Help?¶
Using a local replacement algorithm can limit the spread of thrashing.
- How: If a process begins thrashing, it can only evict its own pages, not steal from others. This contains the damage to that single process.
- Limitation: It doesn't solve thrashing. The thrashing process still hogs the paging device, causing long I/O queues. This increases the average page-fault service time for every process, slowing down the entire system.
The Fundamental Solution: Provide Enough Frames¶
To prevent thrashing, a process must be given enough frames to hold its current working set. But how do we know what that is?
The Locality Model¶
This is a crucial concept for understanding program behavior and memory needs.
- Definition: A locality is a set of pages that are actively used together during a phase of execution. It could be a loop body, a function and its local variables, or a data structure being processed.
- Program Execution: A process executes by moving from one locality to another over time (see Figure 10.21). Localities can overlap (share some pages).
- Basis for Caching: The entire principle of caches (CPU cache, TLB, page cache) relies on this locality of reference. If memory access were perfectly random, caching would be useless.
The Working-Set Principle¶
The working set is the set of pages in the current locality.
- Key Insight: If we allocate enough frames to hold a process's entire current working set, the process will run without page faults within that locality.
- It will only fault when it transitions to a new locality and needs to load a new set of pages.
- If we allocate fewer frames than the working set size: The process cannot keep all actively used pages in memory. It will continuously fault on pages it just evicted—this is the definition of thrashing within a locality.
Figure 10.22 illustrates this with a page reference trace over time windows (Δ). The working set WS(t1) is {1,2,5,6,7} and later changes to WS(t2) = {3,4}. A process needs at least 5 frames during t1 and 2 frames during t2 to avoid thrashing.
Conclusion: The OS must estimate the working set size for each process and allocate at least that many frames. This is the core idea behind the working-set model, which we'll explore next as a strategy for dynamic memory allocation to prevent thrashing.
10.6.2 Working-Set Model¶
Defining the Working Set¶
The working-set model is a practical strategy to estimate a process's current locality and allocate memory accordingly to prevent thrashing.
- The Window (Δ): We define a working-set window of size Δ. This represents the most recent Δ page references (or a time interval of length Δ).
- Working Set (WS): The set of distinct pages referenced during that most recent window is the working set.
- Dynamic Nature: A page enters the working set when it's referenced. It drops out of the working set Δ time units after its last reference within the window (it "ages out").
- Example (Follow Figure 10.22): With Δ = 10 references:
- At time t1, the last 10 references include pages {1, 2, 5, 6, 7}. So WS(t1) = {1, 2, 5, 6, 7}.
- At time t2, the window has slid forward. The recent references are now {3, 4, ...}. So WS(t2) = {3, 4}.
Choosing Δ: A Trade-off¶
The accuracy of the working set as a locality measure depends heavily on Δ.
- Δ too small: The window may not capture the entire locality, missing some actively used pages. The working-set size is underestimated.
- Δ too large: The window may span multiple localities, including pages from an old, no-longer-active locality. The working-set size is overestimated.
- Extreme Case (Δ = ∞): The working set becomes all pages ever touched by the process, which is useless for dynamic allocation.
The Core Principle: Demand vs. Supply¶
The most critical output of the model is the Working-Set Size (WSS_i) for each process i.
- Total Demand:
D = Σ WSS_i(sum of all working-set sizes). - Total Supply:
m= number of available physical frames. - Thrashing Condition: If D > m (total demand exceeds supply), then thrashing is inevitable. At least one process will not have enough frames for its working set.
- Stable Condition: If D ≤ m, it is theoretically possible to give each process its full working set, eliminating thrashing.
Using the Model for Allocation and Load Control¶
The working-set model provides a policy for dynamic memory allocation and controlling the degree of multiprogramming:
- Monitor: The OS continuously estimates the
WSS_ifor each active process. - Allocate: Give each process at least
WSS_iframes. - Admit New Processes: If extra frames remain after satisfying all current
WSS_i, the OS can start a new process. - Suspend Processes: If the total demand
Dexceedsm, the OS must select a process to suspend (swap out entirely). The freed frames are redistributed. The suspended process can be restarted later when memory is available.
- Result: This strategy prevents thrashing by ensuring allocations match actual needs, while maximizing CPU utilization by keeping multiprogramming as high as possible without overcommitting memory.
The Implementation Challenge: Tracking the Working Set¶
The major difficulty is efficiently maintaining the moving window. We need to know, for each page, whether it was referenced in the last Δ accesses.
Approximation with Timer Interrupts and Reference Bits: A practical, approximate method uses the reference bit and periodic interrupts:
- Setup: Let Δ = 10,000 references. Set a timer to interrupt every 5,000 references.
- Operation at each interrupt: For every page, copy the current hardware reference bit into a history slot in software, then clear the hardware bit.
- Result: We maintain, say, 2 history bits representing the last two intervals (the last 10,000 references).
- Determining WS Membership: When considering a page:
- If any of its history bits or its current reference bit is 1, it was referenced in the last ~10,000-15,000 references → It is (likely) in the working set.
- If all bits are 0, it hasn't been used recently → Not in the working set.
- Trade-off: This is inexact because we don't know the exact order within the 5,000-reference interval. We can improve accuracy by using more history bits and shorter intervals, but this increases interrupt overhead and memory for storing bits.
Conclusion: The working-set model provides a sound theoretical framework for thrashing prevention. While exact implementation is costly, its principles guide practical OS heuristics (like the Page-Fault Frequency scheme that follows) and explain why systems must monitor memory pressure and adjust process load dynamically.
10.6.3 Page-Fault Frequency (PFF)¶
A Simpler, Direct Approach¶
The working-set model is theoretically sound but complex to implement precisely. The Page-Fault Frequency (PFF) strategy offers a more direct and measurable control mechanism.
Core Insight: Thrashing is directly observable as a very high page-fault rate. Therefore, we can control thrashing by controlling the page-fault rate.
The PFF Algorithm¶
The idea is to maintain the page-fault rate for each process within a desired range.
Establish Bounds (See Figure 10.23):
- Upper Bound: If the page-fault rate exceeds this limit, the process is faulting too often → it needs more frames.
- Lower Bound: If the page-fault rate falls below this limit, the process is faulting very rarely → it may have too many frames (we can take some away for other processes).
Adjustment Rules:
- Fault Rate > Upper Bound: Allocate an additional frame to the process (if available).
- Fault Rate < Lower Bound: Remove a frame from the process (add it to the free pool).
Global Frame Management & Swapping:
- If a process needs a frame (fault rate too high) but no free frames exist, the OS must select a victim process to swap out entirely. The freed frames are then given to processes with critically high fault rates.
- This provides a direct mechanism to reduce the degree of multiprogramming when memory is over-committed.
Advantages over Working-Set Model¶
- Easier to Measure: The page-fault rate is a straightforward, cheap metric to track (just count interrupts).
- Direct Control: It directly attacks the symptom of thrashing (high fault rate) by adjusting the cause (frame allocation).
Both PFF and the working-set model aim for the same goal: dynamically adjusting frame allocation to match each process's actual needs, thereby preventing thrashing while maximizing multiprogramming.
10.6.4 Current Practice¶
The Modern Solution: Ample Physical Memory¶
While the algorithms (Working-Set, PFF) are important for OS design, the prevailing real-world strategy to avoid thrashing is much more straightforward:
- Provide Enough RAM: The best practice across all systems—from smartphones to servers—is to install sufficient physical memory so that the combined working sets of all active applications and the OS can reside in RAM concurrently.
- Why this works: With enough memory, page faults become rare events that occur mainly at process startup (loading code) or during major phase changes, not from constant contention. Swapping to disk is largely avoided.
- User Experience: This ensures smooth, predictable performance. Thrashing and swapping cause noticeable, unacceptable lag.
The Reality Check¶
- Extreme Conditions: Systems can still experience memory pressure (e.g., running too many massive applications, memory leaks), at which point the OS's page-replacement and reclamation algorithms (Section 10.5.3) become critical.
- Mobile Systems: They often use memory compression (Section 10.7) as a faster alternative to swapping, but the principle remains: try to keep active working sets in fast, uncompressed RAM.
- Conclusion: Sophisticated thrashing-avoidance algorithms are a safety net. The primary line of defense in modern system design is adequate physical memory provisioning. The algorithms ensure graceful degradation when that primary defense is overwhelmed.
10.7 Memory Compression¶
Concept: Compress, Don't Swap¶
Memory compression is a modern alternative to paging out to disk. Instead of writing entire pages to slow secondary storage (swap), the OS compresses multiple in-memory pages into a single frame, freeing up frames for immediate use.
Goal: Reduce memory usage without incurring slow disk I/O.
How It Works (Step-by-Step with Figures)¶
Scenario: The free-frame list is low, triggering page reclamation.
- Select Victims (Figure 10.24): The page-replacement algorithm (e.g., LRU clock) selects victim frames to free (e.g., frames 15, 3, 35, 26). They are placed on a modified-frame list.
- Traditional Path (Swapping): These frames would be written to swap space on disk, then added to the free list. This is slow.
- Compression Path (Figure 10.25):
- Take one frame (frame 7) from the free-frame list to use as a compression target.
- Compress the contents of several victim frames (e.g., 3 frames: 15, 3, 35) into this single frame (7).
- Place this now-packed frame (7) into a new compressed-frame list (a holding area for compressed data).
- The original victim frames (15, 3, 35) are now empty and can be added directly to the free-frame list.
- (Optional) If the fourth victim frame (26) is not compressed, it might be handled separately (maybe swapped if dirty, or just freed if clean).
- Result: We freed 3 frames for immediate reuse, at the cost of 1 frame used for storage, for a net gain of 2 frames. The compressed data stays in RAM.
- Accessing a Compressed Page: If a process later accesses a page that's now compressed (e.g., page in frame 15), a page fault occurs. The OS decompresses the data from the compression frame (7) back into a regular free frame, updates the page table, and resumes execution.
Advantages Over Swapping¶
- Speed: Compression/decompression using the CPU is orders of magnitude faster than disk I/O (even SSD). This makes page-in latency much lower.
- Wear Reduction: On mobile devices with flash storage, avoids frequent writes that wear out the storage.
- Effectiveness: Good compression algorithms can reduce pages to 30-50% of their original size. Compressing 3 pages into 1 frame saves 2 frames, a significant gain.
Widespread Adoption¶
- Mobile OS (Android, iOS): Integral to memory management, as they typically avoid traditional swap.
- Desktop OS:
- macOS (since 10.9): Uses compression first when memory is low, falling back to paging only if compression is insufficient.
- Windows 10: Uses compression, especially for Universal Windows Platform (UWP) apps, providing a common memory management strategy across desktops and mobile devices.
- Performance: Benchmarks show compression outperforms paging even to SSDs in terms of latency.
Design Trade-offs and Algorithms¶
- Compression Ratio vs. Speed: The core trade-off.
- High Ratio (more savings): Requires slower, more complex algorithms.
- Fast Algorithm: May yield lower compression ratios.
- Modern Solutions: Use fast, reasonably efficient algorithms that leverage multiple CPU cores for parallel compression.
- Examples: Microsoft's Xpress, Apple's WKdm. They are designed for speed, targeting ~30-50% compression.
Key Considerations¶
- CPU Overhead: Compression consumes CPU cycles. This is traded for avoided I/O wait.
- Memory Overhead: You must allocate a frame to hold the compressed data. The net gain is the freed frames minus this overhead frame.
- When to Use: Typically triggered before swapping, as a first line of defense against memory pressure. It's a lighter-weight form of "swapping within RAM."
10.8 Allocating Kernel Memory¶
Why Kernel Memory Allocation is Different¶
When a user process requests memory (via malloc() or similar), the kernel satisfies it with whole page frames from the global free-frame list. This leads to internal fragmentation (e.g., a 1-byte request gets a 4KB page). For user processes, this is acceptable due to the simplicity of paging.
The kernel, however, has different requirements and manages its own memory from a separate pool. Two main reasons:
- Small, Variable-Sized Requests: The kernel needs memory for data structures (task structs, inodes, network buffers) that are often smaller than a page (e.g., 256 bytes, 512 bytes). Using whole pages for each would cause severe internal fragmentation, wasting precious kernel memory. The kernel must be extremely memory-efficient.
- Physically Contiguous Memory Requirement: Some hardware devices (DMA controllers, network cards, specialized hardware) perform Direct Memory Access (DMA). They read/write directly to physical RAM addresses, bypassing the MMU and virtual memory. These devices often require blocks of memory that are contiguous in physical address space. User-mode allocations are virtually contiguous but physically scattered; kernel allocations for hardware must guarantee physical contiguity.
We examine two specialized allocators for kernel memory: the Buddy System and Slab Allocation.
10.8.1 Buddy System¶
Purpose and Mechanism¶
The Buddy System manages a large pool of physically contiguous memory (e.g., a chunk of several megabytes). It is a power-of-2 allocator designed to satisfy requests for physically contiguous memory of various sizes while allowing efficient coalescing of freed blocks.
- Unit of Allocation: Memory is allocated in blocks sized as a power of 2 (e.g., 4KB, 8KB, 16KB, ... up to the size of the entire pool).
- Rounding Up: If a kernel request is for a size not a power of 2, it is rounded up to the next higher power of 2. (e.g., a 21KB request becomes 32KB).
- Initial State: The entire pool is a single, large free block (say, 256KB).
Example Allocation (Follow Figure 10.26)¶
Request: 21 KB from a 256 KB pool.
- The 256 KB block is split into two 128 KB "buddies" (
ALandAR). - One 128 KB buddy (
AL) is split into two 64 KB buddies (BLandBR). - Since 21 KB rounds up to 32 KB (next power of 2), one 64 KB buddy (
BL) is split into two 32 KB buddies (CLandCR). - One of the 32 KB buddies (
CL) is allocated to satisfy the request.
- Result: 32 KB is given for a 21 KB request → 11 KB internal fragmentation within the allocated block.
Key Advantage: Fast Coalescing¶
When a block is freed, the allocator checks if its "buddy" (the block it was split from) is also free.
- If the buddy is free, the two are merged (coalesced) back into a single, larger free block of twice the size.
- This merging can recurse up the tree, eventually reconstructing large contiguous segments.
- Example: When
CL(32KB) is freed, its buddyCRis checked. IfCRis free, they merge intoBL(64KB). Later, ifBL's buddyBRis free, they merge intoAL(128KB), and so on.
Major Drawback: Internal Fragmentation¶
Because of rounding up to powers of 2, up to nearly 50% of an allocated block can be wasted.
- Worst case: A request for
(2^n) + 1bytes gets rounded to2^(n+1)bytes, wasting almost half. - Example: A 33 KB request requires a 64 KB block → 31 KB wasted (48% waste).
The Buddy System is therefore good for coarse-grained, physically contiguous allocation, but bad for small, precise allocations. This leads to the need for a more efficient allocator for kernel objects: Slab Allocation.
10.8.2 Slab Allocation¶
Concept: Dedicated Caches for Kernel Objects¶
Slab allocation is designed for efficient, zero-fragmentation allocation of small, fixed-size kernel data structures (objects). It operates on the principle of object caching.
Key Terms:
- Object: An instance of a specific kernel data structure (e.g., a
task_struct, aninode, asemaphore). - Slab: One or more physically contiguous pages that are divided into chunks, each chunk sized to hold exactly one object.
- Cache: A collection of one or more slabs, all dedicated to storing objects of the same type and size. There is one cache per unique kernel data structure.
Visual (Go to Figure 10.27): The figure shows separate caches for 3KB objects and 7KB objects. Each cache contains slabs made from contiguous pages, which are partitioned into slots for objects.
How It Works¶
- Cache Creation: When the kernel initializes, it creates a cache for each type of frequently used object (e.g.,
task_structcache,inodecache). Each cache is initially populated with one or more slabs containing a number of free objects. - Allocation: When the kernel needs a new object (e.g., to create a new process), it goes to the corresponding cache.
- It first tries to get a free object from a partial slab (a slab with some free, some used objects).
- If no partial slab exists, it takes a free object from an empty slab (all objects free).
- If no empty slab exists, it allocates a new slab from the buddy system (requesting contiguous pages), adds it to the cache, and uses an object from it.
- Object States (Linux Example): A slab can be:
- Full: All objects used.
- Empty: All objects free.
- Partial: Mix of used and free.
- Deallocation: When the kernel is done with an object (e.g., a process terminates), it returns the object to its cache, marking it as free. The memory is not released to the general system; it stays in the cache for immediate reuse.
Two Major Benefits¶
- No Internal Fragmentation: Each cache's slabs are divided into chunks that are the exact size of the object they store. An allocation request returns precisely the needed memory, with zero waste per object. This solves the Buddy System's biggest flaw for small allocations.
- Fast Allocation/Deallocation: Objects are pre-allocated and sitting in the cache.
- Allocation is just marking a free object as used.
- Deallocation is just marking it free again.
- This avoids the overhead of calling the general-purpose allocator (buddy system) for every small request. It's especially efficient for high-frequency allocation/deallocation patterns common in the kernel.
Historical Context and Linux Variants¶
- Origin: First appeared in Solaris 2.4.
- Linux Adoption: Originally used the Buddy System. Since kernel 2.2, adopted a slab allocator called SLAB.
- Modern Linux Evolution: Linux now has three slab-like allocators:
- SLAB: The original, full-featured implementation. Maintains per-CPU queues for objects to reduce lock contention, but has significant metadata overhead.
- SLOB (Simple List Of Blocks): Designed for embedded systems with very limited memory.
- Maintains three simple linked lists:
small(<256B),medium(<1024B),large(< page size). - Uses a first-fit policy on the appropriate list.
- Minimal overhead, but less efficient for complex systems.
- Maintains three simple linked lists:
- SLUB (the Unqueued Allocator): Since kernel 2.6.24, the default allocator.
- Reduces metadata overhead compared to SLAB. Stores management data in the kernel's standard page structure, not in separate slabs.
- Removes per-CPU queues, freeing significant memory on large multi-processor systems.
- Provides better performance and scalability on modern multi-core systems.
- Retains the core slab benefits (object caching, no fragmentation).
In summary: The Buddy System provides physically contiguous memory in coarse chunks. The Slab Allocator (and its variants) builds on top of it to provide fast, zero-fragmentation allocation of fixed-size objects for the kernel's internal data structures. They are complementary components of the kernel's memory management.
10.9 Other Considerations¶
10.9.1 Prepaging¶
The Problem with Pure Demand Paging: Cold Start Penalty¶
In a pure demand-paging system, a process (or a resumed process) starts with zero pages in memory. Every page in its initial working set must be faulted in one by one, causing a burst of page faults at startup, leading to noticeable lag.
Prepaging is a proactive strategy to reduce these initial faults by loading pages before they are actually referenced (on speculation).
How Prepaging Can Work¶
- Working-Set Based: When a suspended process is about to be resumed, the OS could preload its entire remembered working set (the set of pages it was actively using before suspension) back into memory before restarting its execution.
- File-Based (Read-ahead): For sequential file access, the OS can predict that after reading block N, the process will likely need block N+1. It can prefetch the next block(s) from disk into the page cache while the process is working on the current block. This is common in file systems (e.g., Linux's
readahead()system call).
The Critical Trade-off: Cost vs. Benefit¶
Prepaging is not free. It consumes I/O bandwidth and memory frames.
- Let:
s= number of pages prepaged. - Let:
α= fraction of prepaged pages that are actually used (0 ≤ α ≤ 1).s * αpages are useful prepages (they prevent a future page fault).s * (1 - α)pages are wasted prepages (loaded but never used).
Prepaging is beneficial only if:
(Cost of s * α saved page faults) > (Cost of prepaging s * (1 - α) unnecessary pages)
- If
αis close to 1: Most prepaged pages are used → Big win. The I/O cost of prepaging is less than the total cost of servicing many individual page faults later. - If
αis close to 0: Most prepaged pages are wasted → Big loss. You've incurred disk I/O and used up memory frames for nothing.
Practical Challenges¶
- Predicting the Future: It's hard to know exactly which pages a program will need next, especially for arbitrary code execution. Predictions are easier for sequential file access.
- Overcommitting Memory: Prepaging consumes frames that could be used for other processes. If predictions are wrong, it can actually cause paging in other processes.
Conclusion: Prepaging is a high-risk, high-reward optimization. It is most reliably applied in predictable scenarios like file read-ahead or restoring a known working set. For general program execution, modern systems rely more on large physical memory and good caching to mitigate the cold-start penalty, rather than aggressive speculative prepaging.
10.9.2 Page Size¶
The Fundamental Trade-offs¶
Choosing a page size is a critical architectural decision with no single optimal answer. It involves balancing multiple, often conflicting, factors. Page sizes are typically powers of 2, ranging historically from 4KB to several megabytes.
Let's analyze the trade-offs:
Arguments for a LARGER Page Size¶
Smaller Page Tables (Reduced Memory Overhead):
- For a fixed virtual address space size, larger pages mean fewer total pages.
- Example: A 4MB (
2^22bytes) virtual address space.- 1KB pages: 4,096 pages → large page table.
- 8KB pages: 512 pages → much smaller page table.
- Since each process needs its own page table, smaller page tables save significant kernel memory.
Reduced Page Fault Overhead (Fewer Faults):
- Larger pages bring more data into memory per fault.
- Example: A 200KB process that actively uses 100KB.
- With a 200KB page: 1 page fault loads everything (used + unused).
- With 1KB pages: Up to 100 page faults just for the used data.
- Each page fault has high fixed software overhead (interrupt handling, table updates). Fewer faults reduce this overhead.
Improved I/O Efficiency for Large Transfers:
- Disk I/O time = Seek Time + Rotational Latency + Transfer Time.
- Seek and Latency dominate (~8 ms), Transfer time is minimal (~0.01 ms for 512 bytes).
- Reading one 16KB page takes only marginally longer than reading one 4KB page (e.g., 8.02 ms vs. 8.01 ms), but transfers 4x more data per I/O operation.
- Amortizes the high seek/latency cost over more data.
Arguments for a SMALLER Page Size¶
Less Internal Fragmentation (Better Memory Utilization):
- Memory is allocated in whole pages. A process rarely ends exactly on a page boundary.
- On average, half of the last page of each process is wasted.
- Waste per process:
Page Size / 2.- 4KB page → ~2KB average waste.
- 2MB page → ~1MB average waste! (Huge problem).
- Smaller pages minimize this per-process waste.
Better Resolution / Matches Locality (Reduces I/O Volume):
- Smaller pages allow the OS to bring in only the data that is actually needed, precisely matching the program's locality.
- Example: A process only uses 100KB scattered across a 200KB address space.
- With 4KB pages: Might need to load ~25 pages (100KB), leaving the other 25 pages on disk.
- With 200KB page: Forced to load all 200KB, including 100KB of unused data. This wastes I/O bandwidth and memory frames.
- Smaller pages reduce unnecessary I/O and allow more efficient use of physical RAM.
Finer-Grained Protection and Sharing:
- Permissions (read, write, execute) are set at the page level.
- Smaller pages allow more precise protection of memory regions (e.g., a small guard page between stacks).
- Allows sharing of smaller regions between processes.
The Historical Trend¶
- 1980s-1990s: The 4KB page became a de facto standard, representing a compromise between the factors above.
- Modern Systems: Trend is toward supporting multiple page sizes.
- Why? Different workloads benefit from different sizes.
- Large Pages (e.g., 2MB, 1GB): Used for big, contiguous data sets (scientific computing, databases) to reduce TLB misses and page fault counts.
- Small Pages (4KB): Remain the default for general-purpose code, minimizing fragmentation.
- Mobile/Embedded: Also use larger pages (e.g., 4KB, 16KB, 64KB) to reduce TLB pressure and management overhead, as physical memory is often constrained.
Conclusion: The choice is a classic engineering compromise. The modern solution is not to choose one, but to support multiple page sizes (via mechanisms like huge pages or superpages), allowing the OS and applications to select the appropriate granularity for different parts of the address space. The next section explores this.
10.9.3 TLB Reach¶
The Problem: TLB as a Performance Bottleneck¶
The Translation Lookaside Buffer (TLB) is a small, fast hardware cache for page table entries. Its performance is critical because every memory access requires an address translation.
- TLB Hit Ratio: The percentage of translations found in the TLB. A low hit ratio means frequent, slow walks of the full page table in memory.
- Limitation: TLBs are small (often 64-1024 entries) due to cost and power constraints. If a process's working set (the set of pages it's actively using) doesn't fit in the TLB, performance plummets due to constant TLB misses.
TLB Reach: A Key Metric¶
- Definition: TLB Reach = (Number of TLB Entries) × (Page Size)
- Interpretation: The total amount of physical memory that can be simultaneously mapped by the TLB without a miss. It's the "coverage" of the TLB.
- Goal: The TLB reach should be at least as large as the process's working set. If not, the process will thrash the TLB just as it can thrash pages in memory.
Strategies to Increase TLB Reach¶
1. Increase the Number of TLB Entries:
- Effect: Directly increases reach (Reach ∝ Entries).
- Drawback: Hardware cost and power consumption increase significantly (TLB uses expensive associative memory).
2. Increase the Page Size:
- Effect: Dramatically increases reach (Reach ∝ Page Size). Doubling page size doubles reach with the same number of TLB entries.
- Example: 64-entry TLB with 4KB pages → Reach = 256KB. With 2MB pages → Reach = 128MB.
- Drawback: Leads to increased internal fragmentation (as discussed in 10.9.2). Using a single, large page size is a blunt instrument.
3. Support Multiple Page Sizes (The Modern Solution): This is the most flexible approach. The OS and hardware collaborate to use large pages ("huge pages") for big, contiguous data regions, while keeping small pages for general code and data.
- How it works: The TLB entry includes a field to specify the page size it maps (e.g., 4KB, 2MB, 1GB). The OS can choose the appropriate size when setting up mappings.
- Example - Linux Huge Pages: The default is 4KB, but an application (like a database) can request a portion of its address space to be backed by 2MB or 1GB huge pages. One TLB entry then covers a massive region, eliminating thousands of potential TLB misses.
Advanced Hardware Support: ARMv8 Contiguous Bit¶
The ARMv8 architecture provides a clever optimization for increasing effective TLB reach without changing the fundamental page size used by the OS.
- Contiguous Bit: A special bit in a TLB entry.
- When Set: Indicates that this single TLB entry maps not one, but a contiguous block of adjacent base-sized pages.
- Example Configurations from the text:
- A TLB entry can map 16 contiguous 4KB pages → acts like a 64KB page.
- Or 32 contiguous 32MB blocks → acts like a 1GB region.
- Benefit: The OS can tell the hardware that a certain range of virtual addresses maps to a contiguous physical range. The hardware can then use a single TLB entry for the entire block, massively boosting effective reach. The OS still manages memory in the base page size (e.g., 4KB) for fine-grained control, but the TLB sees large blocks.
Software-Managed TLBs and the Cost¶
Some architectures (like MIPS, SPARC) have software-managed TLBs. On a TLB miss, a hardware trap occurs, and an OS handler must walk the page table and load the correct entry into the TLB.
- Cost: This is slower than a hardware page-table walker (like in x86).
- Benefit for Multiple Sizes: It gives the OS maximum flexibility in choosing page sizes and managing TLB contents, as the management policy is in software.
- Trade-off: The performance overhead of software management is often offset by the dramatically higher TLB hit ratio achieved through smart use of multiple page sizes and contiguity optimizations.
Conclusion: Increasing TLB reach is essential for performance in memory-intensive applications. The modern strategy is not just bigger TLBs, but smarter use of larger and/or contiguous page mappings through OS–hardware cooperation, allowing a small TLB to cover a very large working set effectively.
10.9.4 Inverted Page Tables¶
Recap: The Goal of Inverted Page Tables¶
As introduced in Section 9.4.3, the standard per-process page table has one entry per virtual page, which becomes huge for large address spaces (e.g., 2^20 entries for 32-bit with 4KB pages).
- Inverted Page Table (IPT) Goal: Drastically reduce the memory overhead of translation tables. It does this by having **one entry per *physical frame***, not per virtual page.
- Structure: Each entry in the IPT contains: <Process-ID (PID), Virtual Page Number (VPN)>. It answers the question: "Which process, and which of its virtual pages, is currently occupying this physical frame?"
The Problem with Demand Paging and Inverted Tables¶
While the IPT efficiently tracks what's in memory, it loses vital information needed by demand paging:
- Missing Information: The IPT does not store the location on disk (backing store address) for pages that are not currently in memory. It only knows about pages that are currently resident.
- Consequence: When a page fault occurs for a non-resident page, the OS cannot use the IPT to find where on disk to fetch that page from.
The Required Solution: External Page Tables (EPTs)¶
To handle page faults, the system must maintain additional, per-process data structures alongside the IPT.
- External Page Table (EPT): One per process. It functions like a traditional forward page table, mapping each virtual page number to its backing store location (disk address). It may also contain other flags (valid, dirty, etc.).
- Role: The EPT is only consulted during a page fault to locate the missing page on disk. It is not used for normal address translation (that's the IPT's job).
A Major Complication: Paging the Page Tables¶
Since EPTs can be large (one per virtual page), they themselves are too big to keep permanently in kernel memory, especially with many processes.
- Solution: The OS pages the EPTs themselves. They are stored in virtual memory and can be swapped out.
- The Nightmare Scenario (Double Page Fault):
- Process A accesses virtual page X → Page fault (X not in IPT).
- OS page fault handler needs to consult Process A's External Page Table (EPT) to find X's disk location.
- Oops! The needed portion of Process A's EPT is itself paged out to disk.
- This triggers a second page fault to bring in the EPT page itself.
- Result: A single user-level page fault can lead to two disk I/O operations: one to fetch the EPT page, then another to fetch the actual user page. This causes significant delay and complexity.
Kernel Handling and Impact¶
- Careful Kernel Design: The OS must handle this recursive fault scenario very carefully to avoid infinite loops. Typically, the kernel pins (keeps resident) a minimal set of its own data structures, including the page of the IPT and the EPT that describes the currently running fault handler.
- Performance Trade-off: The memory savings of the inverted page table (a single, global table) come at the cost of increased complexity and the potential for higher latency on page faults due to the need to possibly fault in an EPT.
- Use Case: Inverted page tables are most beneficial in systems where physical memory is very scarce and expensive relative to address space size, and where the performance penalty of complex fault handling is acceptable (e.g., some embedded or high-end server systems). Modern 64-bit systems often use other schemes (like multi-level page tables) that are less complex for demand paging.
10.9.5 Program Structure¶
The Transparency Illusion and Programmer/Compiler Awareness¶
While demand paging aims to be transparent, its performance impact is heavily influenced by how a program accesses memory. Programmers and compilers that understand paging can write cache-friendly and page-friendly code for significant speedups.
A Classic Example: Array Traversal Order¶
The example starkly shows how a simple logic change can cause 16,384 page faults vs. 128.
Assumptions:
- Page size = 128 words (integers).
- Array
data[128][128]is stored in row-major order (C/C++ default). This meansdata[0][0]todata[0][127]occupy contiguous memory, filling Page 0.data[1][0]todata[1][127]fill Page 1, and so on. - The OS allocates fewer than 128 frames to the program (very likely).
Code 1: Column-Major Inner Loop (DISASTER)
for (j = 0; j < 128; j++)
for (i = 0; i < 128; i++)
data[i][j] = 0; // BAD: Accesses column j across all rows
- Access Pattern:
data[0][0](Page 0),data[1][0](Page 1), ...,data[127][0](Page 127). This touches a different page on every single access. - With <128 frames, after 127 iterations, all frames are full of different pages from column 0. The next access (
data[0][1]) needs Page 0 again, but it was likely evicted. Page fault. - Result: Every single array access (16,384 total) causes a page fault because the loop strides across pages, exceeding the available frames.
Code 2: Row-Major Inner Loop (OPTIMAL)
for (i = 0; i < 128; i++)
for (j = 0; j < 128; j++)
data[i][j] = 0; // GOOD: Accesses all elements in row i (Page i)
- Access Pattern: All 128 elements of
data[0][j](Page 0), then all 128 elements ofdata[1][j](Page 1), etc. - Page Faults: Only one fault per page (128 total). Once Page 0 is loaded, all 128 accesses to it hit. Then it can be evicted for Page 1, and so on. Perfect locality within each page.
General Principles for Good Locality¶
Data Structure Choice:
- Good Locality: Stacks, sequential arrays (accessed linearly). They concentrate references in few pages.
- Bad Locality: Hash tables, linked lists with scattered nodes. They intentionally randomize or scatter accesses, causing many page touches.
- Trade-off: You must balance locality against the data structure's primary purpose (e.g., fast lookup for hash tables).
Compiler and Loader Optimizations:
- Separate Code & Data: Mark code pages as read-only. Clean pages never need to be written back to disk on eviction, saving I/O time.
- Reentrant Code: Allows multiple processes to share the same physical code pages (e.g., shared libraries), reducing total memory footprint.
- Page Boundary Alignment: The loader can try to place entire functions/routines within a single page to minimize intra-procedure page faults.
- Packing Related Code: Functions that call each other frequently (working set grouping) can be packed into the same page to reduce TLB misses and faults during execution flow.
- The Bin-Packing Problem: The loader faces the challenge of placing variable-sized code and data segments into fixed-size pages to minimize cross-page references. This is especially impactful with large page sizes, where poor packing wastes more space and causes more unnecessary page inclusions.
Key Takeaway¶
Memory access pattern is performance. Even with ample RAM, poor locality increases cache misses (CPU cache), TLB misses, and potential page faults. Writing spatially local code—accessing memory addresses close together before moving far away—is a fundamental optimization that benefits all levels of the memory hierarchy, from L1 cache to virtual memory.
10.9.6 I/O Interlock and Page Locking¶
The Problem: I/O to a Page That Gets Evicted¶
This addresses a critical race condition between page replacement and Direct Memory Access (DMA) I/O.
The Dangerous Sequence:
- Process A issues a read() system call to read disk data into its memory buffer (a user-space address).
- The OS sets up the DMA transfer: tells the disk controller the physical address of the buffer page and starts the I/O. Process A is then put to sleep waiting for I/O completion.
- The CPU switches to Process B.
- Process B causes page faults. Using a global replacement algorithm, it selects a victim frame. Unfortunately, it picks the exact frame that contains Process A's I/O buffer, which is currently in the middle of a DMA transfer. The OS pages it out.
- The disk controller finishes the DMA transfer, writing data directly to the original physical frame. But that frame now holds a completely different page belonging to Process B (or someone else).
- Result: Data corruption. Process A's buffer contains wrong data, and Process B's page is silently overwritten.
This is why frames involved in pending I/O must be locked in memory (pinned).
Solutions¶
1. Copying via System Buffers (No Direct I/O):
- Method: Never allow DMA directly to/from user-space buffers. Instead:
- For a read: Disk → Kernel system buffer (always resident) → copy to user buffer.
- For a write: User buffer → copy to Kernel system buffer → Disk.
- Advantage: Eliminates the race condition; kernel buffers are pinned.
- Disadvantage: Extra memory copy on every I/O operation, causing significant CPU overhead and latency. Often unacceptable for high-performance I/O.
2. Page Locking (Pinning) - The Common Solution:
- Mechanism: Each frame has a lock (or pin) bit. If set, the frame cannot be selected for replacement.
- Procedure for DMA I/O:
- Before starting the DMA transfer to/from a user page, the OS locks (pins) the corresponding frame(s) in memory.
- The I/O proceeds directly to/from the user buffer.
- Upon I/O completion interrupt, the OS unlocks the frame(s).
- This is depicted in Figure 10.28: The buffer frame must be in memory and locked for the duration of the disk I/O.
Other Uses of Page Locking¶
- Kernel Code and Critical Data: The core kernel, and especially the memory management code itself, is often locked in memory to prevent a page fault within the page fault handler (which would cause a deadly recursive fault).
- Real-Time and Database Applications: Applications (e.g., databases) that manage their own disk caching may pin their buffer pool in memory to guarantee performance and control, bypassing the OS page cache. This requires special privileges.
- Preventing Immediate Re-Eviction (The "Second-Chance" for New Pages):
- Scenario: A low-priority process faults, a page is brought in for it. Before it gets CPU time to use it, a high-priority process faults and wants a frame. The freshly loaded (but unused) page from the low-priority process looks like a perfect clean victim.
- Policy Decision: Should we allow this? It wastes the I/O done for the low-priority process.
- Solution using Lock Bit: When a page is faulted in, lock it temporarily. The lock is released only after the faulting process has been scheduled and had a chance to use the page at least once. This gives the page a "fair chance" to be useful.
Dangers and Safeguards¶
- Lock Bit Leak: If a bug causes a lock bit to be set but never cleared, that frame becomes permanently unusable, effectively reducing available memory. This is a serious kernel bug.
- Resource Exhaustion: A malicious or buggy process could lock all of memory.
- OS Safeguards: Modern OSes impose limits and can override lock hints under extreme memory pressure.
- Example - Solaris: Allows locking hints but is free to disregard them if the free-frame pool becomes too small or if a process requests excessive locking.
10.10 Operating-System Examples¶
10.10.1 Linux¶
Linux uses a combination of demand paging and a global page-replacement policy that approximates LRU using a two-list (active/inactive) clock-like algorithm.
Core Data Structures: Active and Inactive Lists¶
To manage page aging and reclamation, Linux maintains two primary page lists:
- Active List: Contains pages that are currently considered "in use" or "hot" (part of a process's working set).
- Inactive List: Contains pages that are candidates for reclamation (considered "cold" or not recently used). These pages are eligible to be freed if memory is needed.
The Page Lifecycle Algorithm (Follow Figure 10.29)¶
This is a refined clock algorithm with two hands (lists).
Initial Allocation & Reference:
- When a page is first allocated (on a page fault), its accessed (reference) bit is set, and it is placed at the rear (tail) of the Active List. It is considered "new and hot".
Active List Management:
- While on the Active List, if the page is referenced again, its accessed bit is set and it is moved to the rear of the Active List (promoted, kept hot).
- Periodically, a kernel daemon (
kswapd) scans the Active List and clears the accessed bits of pages as it passes. This is the "clock hand" movement. - A page that survives a scan without its accessed bit being set again slowly moves towards the front of the Active List as newer pages are added to the rear. It becomes a candidate for demotion.
Demotion to Inactive List:
- To keep the lists balanced, when the Active List grows too large, pages from the front of the Active List (the oldest, least recently accessed) are moved to the rear of the Inactive List.
- Once in the Inactive List, a page is in the "danger zone" but can still be rescued.
Rescue from Inactive List:
- If a page in the Inactive List is referenced (accessed bit set), it is promoted back to the rear of the Active List.
- This prevents frequently used pages from being evicted even if they briefly looked cold.
Reclamation from Inactive List:
- When the page-out daemon (
kswapd) runs, it primarily scans the Inactive List for victims. - It checks the accessed bit and dirty (modified) bit of pages in the Inactive List:
- If accessed bit = 1: Page was recently used → promote back to Active List.
- If accessed bit = 0: Page is truly cold.
- If clean (not dirty): Can be immediately reclaimed (added to free list).
- If dirty: Must be scheduled for write-back to disk before its frame can be freed.
- When the page-out daemon (
The Role of kswapd (Page-Out Daemon)¶
- Trigger:
kswapdperiodically wakes up or is triggered when free memory falls below a threshold (similar to the reaper in Section 10.5.3). - Action: It performs the scanning and reclamation process described above, moving pages between lists and freeing frames, aiming to keep free memory above a high-water mark.
- Global Replacement: Since
kswapdoperates on a global set of pages (from all processes), Linux uses a global replacement policy. This allows memory to flow dynamically to the processes that need it most.
This two-list strategy is a sophisticated LRU approximation. The Inactive List acts as a "grace period" or probationary zone before final eviction, allowing recently used pages a chance to prove they are still active, thereby improving accuracy over a simple single-clock algorithm.
10.10.2 Windows¶
Architectural Support¶
Windows 10 is a hybrid OS supporting multiple architectures with vast address spaces:
- 32-bit (x86): 2-3 GB virtual address space per process, up to 4 GB physical RAM.
- 64-bit (x86-64, ARM): 128 TB virtual address space, supports up to 24 TB physical RAM (Windows Server supports up to 128 TB).
- Features Supported: Demand paging, copy-on-write, shared libraries, memory compression, and clustering.
Key Feature: Paging with Clustering¶
Windows uses an aggressive form of prepaging to exploit spatial locality.
- Clustering: On a page fault, Windows doesn't just load the faulting page. It loads a cluster (block) of pages surrounding the faulting page.
- Cluster Sizes:
- Data Page Fault: Cluster of 3 pages (the faulting page + the page before and after it).
- Other Page Faults (e.g., code): Cluster of 7 pages.
- Rationale: Anticipates that the process will likely need nearby pages soon. This amortizes the disk I/O cost (seek/latency) over multiple pages, reducing future page faults.
Working-Set Management¶
Windows actively manages each process's working set (its resident pages in memory) using soft limits.
- Default Limits (for a new process):
- Working-Set Minimum: 50 pages (guaranteed lower bound if memory is available).
- Working-Set Maximum: 345 pages (soft upper limit).
- Flexible Enforcement: These are guidelines, not hard limits.
- A process can grow beyond its maximum if free memory is plentiful.
- A process can shrink below its minimum under severe memory pressure.
Page Replacement Policy: Hybrid Approach¶
Windows uses a combination of local and global replacement, built around the clock algorithm (LRU approximation).
- Local Policy (Per-Process): When a process at its working-set maximum incurs a page fault and free memory is low, the virtual memory manager uses a local clock algorithm to select a victim page from that process's own working set.
- Free List: The system maintains a list of free page frames. If free memory is above a threshold, page faults are satisfied from this list.
Global Memory Pressure Response: Automatic Working-Set Trimming¶
When system-wide free memory falls below a critical threshold, Windows activates global reclamation.
- Mechanism: The virtual memory manager systematically reduces the size of process working sets to free frames.
- Target Selection: It evaluates all processes. Priority for trimming is given to:
- Larger processes (more pages to take).
- Idle processes (less likely to need their pages immediately).
- Trimming Rules:
- A process is trimmed down towards its working-set minimum.
- If severe pressure continues, trimming can force a process below its minimum.
- Both user and system processes are subject to trimming (though critical kernel pages are locked).
- Goal: Quickly replenish the global free-frame list to a safe level.
Summary: Windows memory management is adaptive and hybrid. It uses local replacement for contained pressure within a process, and aggressive global trimming for system-wide memory crises. The working-set model with soft limits provides a framework, while clustering optimizes for common access patterns. This balances performance for individual applications with overall system stability.
10.10.3 Solaris¶
Core Mechanism: The Two-Handed Clock (pageout)¶
Solaris uses a sophisticated global page-replacement algorithm known as the two-handed clock, a variant of the second-chance algorithm. The goal is to maintain a sufficient pool of free pages.
- Key Parameter:
lotsfree= Threshold to begin active page scanning. Typically set to 1/64th of total physical memory. - Trigger: Four times per second, the kernel checks free memory. If free pages <
lotsfree, thepageoutdaemon (page scanner) is activated.
The Two-Handed Clock Operation¶
Imagine a clock face with two hands moving in the same direction, with a fixed gap (handspread) between them.
Front Hand (Clearing Hand):
- Scans through all pages in memory.
- For each page, it clears (sets to 0) the reference bit. This action marks the page as "possibly cold" starting now.
Back Hand (Reaping Hand):
- Follows the front hand at a fixed distance (
handspreadpages behind). - Checks the reference bit of the page the front hand cleared earlier.
- Decision:
- If reference bit = 0: The page was not re-referenced in the interval between the two hands passing. It is truly cold. The back hand reclaims it: if dirty, schedules a write-back; then adds its frame to the free list.
- If reference bit = 1: The page was referenced again after the front hand cleared it. It is still hot. The back hand skips it (gives it a second chance). Its bit remains 1 until the front hand comes around again.
- Follows the front hand at a fixed distance (
Advantage over Single-Hand Clock: The two-handed clock introduces a measurable "aging" period (handspread / scanrate). A page must survive unreferenced for this entire period before being evicted, making LRU approximation more accurate.
Dynamic Scan Rate (scanrate)¶
The speed at which the clock hands move is not fixed. It adjusts dynamically based on memory pressure.
slowscan: Default = 100 pages/second. Used when memory pressure is just beginning (free memory just belowlotsfree).fastscan: Maximum = min(8192, total_physical_pages / 2) pages/second. Used under severe memory pressure.- Adaptation: As free memory decreases, the
scanrateincreases linearly fromslowscantofastscan(see Figure 10.30). This makes reclamation more aggressive when needed.
Aging Interval: The time between a page's reference bit being cleared and checked is:
Aging Time = handspread / scanrate.
Example: handspread=1024 pages, scanrate=100 pages/sec → Aging time = ~10 seconds. If scanrate increases to 4000 pages/sec, aging time drops to ~0.25 seconds.
Multi-Tiered Response to Increasing Memory Pressure¶
Solaris has a graduated response system with multiple thresholds (refer to Figure 10.30):
Level 1 - Normal Scanning (Free <
lotsfree):pageoutruns 4 times per second at ascanratebased on how far belowlotsfreememory is.
Level 2 - Aggressive Scanning (Free <
desfree- "desired free"):pageoutruns 100 times per second to aggressively reclaim pages and try to keep free memory abovedesfree.- If it cannot maintain
desfreeover a 30-second average, the kernel escalates.
Level 3 - Swapping (Process Termination):
- The kernel starts swapping out entire idle processes to free all their pages at once. This is a last resort before collapse.
Level 4 - Emergency (Free <
minfree):- Every request for a new page (page fault) triggers the
pageoutscanner synchronously. System performance grinds down as every allocation waits for reclamation.
- Every request for a new page (page fault) triggers the
Optimizations and Special Handling¶
- Reclaim from Free List (Minor Fault): If a process faults on a page that was recently placed on the free list but not yet reused, Solaris allows it to reclaim that same page (a minor fault), avoiding disk I/O.
- Shared Library Pages: The scanner skips pages belonging to shared libraries (used by many processes), as evicting them would cause major faults in multiple processes.
- Priority Paging: Distinguishes between process pages and file system page cache pages. Under memory pressure, it preferentially reclaims from the file cache (which can be re-read from disk) before taking pages from processes. This protects application working sets. (Detailed in Section 14.6.2.)
Summary: Solaris uses a dynamic, multi-threshold, two-handed clock algorithm for global page replacement. Its strength lies in adaptive aggressiveness (scanrate), a robust aging mechanism (two hands), and a graduated crisis response from scanning → aggressive scanning → swapping, ensuring system stability under varying loads.
10.11 Summary¶
Core Concept¶
Virtual Memory abstracts physical RAM into a vast, uniform logical address space, providing powerful illusions and capabilities to both programmers and processes.
Key Benefits of Virtual Memory¶
- Larger Than Physical Memory: Programs can have address spaces larger than available RAM.
- Partial Residence: A program does not need to be entirely loaded in memory to execute.
- Memory Sharing: Processes can share code and data (e.g., shared libraries) efficiently.
- Efficient Process Creation: Copy-on-Write (COW) allows fast
fork()by sharing pages initially and copying only upon modification.
Demand Paging¶
The fundamental technique that enables these benefits.
- Principle: Load a page into physical memory only when it is first accessed (on demand).
- Page Fault: The event that occurs when a needed page is not in RAM. The OS handles it by:
- Finding a free frame (may require page replacement).
- Loading the page from disk.
- Updating the page table.
- Restarting the faulting instruction.
Page Replacement Algorithms¶
When no free frames exist, the OS must select a victim page to evict. Key algorithms:
- FIFO (First-In, First-Out): Simple but suffers from Belady's Anomaly (more frames can cause more faults).
- OPT (Optimal): Replaces the page used farthest in the future. Theoretically best but unimplementable (needs future knowledge). Serves as a benchmark.
- LRU (Least Recently Used): Replaces the page unused for the longest time. Excellent but expensive to implement exactly. It is a stack algorithm (no Belady's Anomaly).
- LRU Approximations (Practical):
- Reference Bit: Tracks whether a page was used recently.
- Second-Chance/Clock Algorithm: FIFO with a second chance for referenced pages.
- Enhanced Second-Chance: Considers both reference (R) and modify (M) bits to minimize I/O.
Scope of Replacement:
- Global Replacement: Victim can be chosen from any process. More common, allows dynamic memory flow.
- Local Replacement: Victim chosen only from the faulting process's own frames. Provides isolation but can be less efficient.
Thrashing and Its Prevention¶
- Thrashing: A pathological state where the system spends more time paging than executing due to severe memory over-commitment.
- Cause: Total working set sizes of all processes exceed available physical memory.
- Prevention Strategies:
- Working-Set Model: Allocate enough frames to hold each process's current locality.
- Page-Fault Frequency (PFF): Dynamically adjust frame allocation to keep a process's page-fault rate within bounds.
- Current Practice: Install sufficient physical RAM to accommodate typical working sets.
Memory Compression¶
An alternative to swapping used prominently in mobile systems (Android, iOS) and modern desktops (Windows, macOS).
- Action: Compresses several in-memory pages into a single frame, freeing frames without disk I/O.
- Trade-off: CPU compression/decompression overhead vs. much faster "paging" within RAM.
Kernel Memory Allocation¶
Kernel memory has special needs (small, variable-sized objects; physical contiguity for DMA). Two specialized allocators:
- Buddy System: Allocates physically contiguous blocks in power-of-2 sizes. Fast coalescing but causes internal fragmentation.
- Slab Allocation: Creates object caches for each kernel data type. Allocates exact-size objects from pre-initialized slabs, achieving zero fragmentation and fast allocation/deallocation. (Linux uses SLAB, SLOB, SLUB variants).
Performance Considerations¶
- TLB Reach: Amount of memory mappable by the TLB (= TLB entries × Page Size). Increased via:
- Larger pages (Huge Pages).
- Multiple page sizes.
- Hardware support for contiguous blocks (ARMv8 contiguous bit).
- Program Structure: Memory access patterns (locality) drastically affect page-fault rates. Row-major vs. column-major array traversal is a classic example.
- I/O Interlock: Pages involved in DMA I/O must be locked (pinned) in memory to prevent data corruption.
Real-World Systems¶
- Linux: Uses a two-list (active/inactive) clock algorithm managed by the
kswapddaemon. - Windows: Uses clustering (prepaging), a hybrid local/global clock algorithm, and automatic working-set trimming under memory pressure.
- Solaris: Uses a sophisticated two-handed clock algorithm with a dynamic scan rate and a multi-tiered response (
lotsfree,desfree,minfreethresholds) to memory pressure.
Virtual memory is the cornerstone of modern system performance and programmability, seamlessly weaving together hardware support, OS algorithms, and application behavior to create the powerful abstraction of abundant, private, fast memory for every process.
The End¶
This handbook covers key concepts from:
- Silberschatz, A., Galvin, P. B., & Gagne, G. (2021). Operating System Concepts (10th ed.). Wiley.
To fully master operating systems:
- Purchase the official textbook for complete details and exercises
- Practice with real systems (Linux, Windows kernel programming)
- Join OS development communities
- Take formal courses (Coursera, MIT OpenCourseWare, etc.)
Last updated: 2026/2/3
Handbook made by Mani Hosseini (@manih84)